How to extract image urls using bash? - bash

I would like to extract the image URL from page's html code using bash commands and then download all images from that page. I am not sure whether it is possible, as sometimes they ae stored in folders which I wouldn't have access to.
But is it possible to download them from the source code?
I have written this so far:
wget -O plik.txt $1
grep *.jpg plik.txt > wget
grep *.png plik.txt > wget
grep *.gif plik.txt > wget
rm plik.txt```

Using lynx (a text web browser) in non-interactive mode, and GNU xargs:
#!/bin/bash
lynx -dump -listonly -image_links -nonumbers "$1" |
grep -Ei '\.(jpg|png|gif)$' |
tr '\n' '\000' |
xargs -0 -- wget --no-verbose --
This will start downloading matching image URLs in the web page URL given in $1, straight away.
It will include both images in the page, and images that are linked. Removing -image_links will skip images on the page.
You can add/remove whichever extensions you want to download, following the pattern I provided for .jpg, .png, and .gif. (grep -i is case insensitive).
The reason for using null delimiters (via tr) is to use xargs -0, which will avoid problems with URLs which contain a single quote/apostrophe (').
The --no-verbose flag for wget just simplifies the log output. I find it easier to read if downloading a large list of files.
Note that regular GNU wget will handle any duplicate filenames, by appending a number (foo.jpg.1 etc). However, busybox wget for example just exits if a filename exists, abandoning further downloads.
You can also modify the xargs to just print a list of the files to be downloaded, so you can review it first: xargs -0 -- sh -c 'printf "%s\n" "$#"' _

Related

Copy files that have at least the mention of one certain word

I want to look through 100K+ text files from a directory and copy to another directory only the ones which contain at least one word from a list.
I tried doing an if statement with grep and cp but I have no idea how to make it to work this way.
for filename in *.txt
do
grep -o -i "cultiv" "protec" "agricult" $filename|wc -w
if [ wc -gt 0 ]
then cp $filename ~/desktop/filepath
fi
done
Obviously this does not work but I have no idea how to store the wc result and then compare it to 0 and only act on those files.
Use the -l option to have grep print all the filenames that match the pattern. Then use xargs to pass these as arguments to cp.
grep -l -E -i 'cultiv|protec|agricult' *.txt | xargs cp -t ~/desktop/filepath --
The -t option is a GNU cp extension, it allows you to put the destination directory first so that it will work with xargs.
If you're using a version without that option, you need to use the -J option to xargs to substitute in the middle of the command.
grep -l -E -i 'cultiv|protec|agricult' *.txt | xargs -J {} cp -- {} ~/desktop/filepath

How to search and replace with egrep and sed on macOS?

I want to match a pattern in a file and replace it.
This command works with egrep, xargs and sed:
egrep -lRZ "hello" . | xargs -0 -l sed -i -e 's/hello/world/g'
The problem: It does not work on MacOS because the xargs of MacOS does not support the argumente -l.
xargs: illegal option -- l
usage: xargs [-0opt] [-E eofstr] [-I replstr [-R replacements]] [-J replstr]
[-L number] [-n number [-x]] [-P maxprocs] [-s size]
[utility [argument ...]]
How is this solvable on MacOS?
There are actually three incompatibilities you're going to run into here between the GNU (Linux) vs. bsd (macOS) utilities.
The one you're getting an error message from is that bsd's xargs doesn't accept the -l option. But -l is equivalent to -L except that -L requires an argument specifying the maximum number of lines to pass per invocation of the command, while -l defaults to one if it isn't specified. Thus, you can just replace -l with -L1. -L is understood the same way by both the GNU and bsd versions of xargs, so using this is portable between Linux and macOS.
But in this particular case, there's another even easier option: sed is perfectly capable of operating on multiple files per invocation, so there's no reason to limit it to one per invocation. This'll even be slightly faster, since it doesn't have to spend as much time launching new processes. So just leave -l off.
The GNU and bsd versions of egrep (and others in the grep family) both take the option -Z, but they use it to mean completely different things. With GNU, egrep -Z prints zero bytes (ASCII NUL characters) after each filename (matching what xargs -0 expects). But with bsd, egrep -Z is equivalent to zgrep -- it treats its input files as zip archives, and expands them before searching their contents.
Fortunately, both versions understand --null to invoke zero-byte delimiters, so you can use that portably on both platforms.
Both the GNU and bsd versions understand -i<suffix> to mean "edit in place, but make a backup copy, and back up the original with the specified filename suffix". And for both of them, if the suffix is zero-length, it doesn't keep a backup. Unfortunately, the way you specify a zero-length suffix is different and (as far as I've been able to find) irreconcilably incompatible. Specifically, GNU requires the suffix to be directly attached to the -i (e.g. -i.bkp), so just specifying -i by itself is enough to specify in-place-without-backup mode. But bsd allows the suffix to be passed as a separate argument (e.g. -i .bkp), so if you just specify -i by itself, it'll use whatever the next argument is as a suffix (e.g. sed -i -e 's/hello/world/g' will use "-e" as a suffix). To specify in-place-without-backup mode, you need to follow -i with an explicit empty argument (e.g. sed -i '' -e 's/hello/world/g'). But if you do that with GNU's sed, it'll try to execute the empty argument as its script, which will fail.
With all that, here's the macOS version of your command:
egrep -lR --null "hello" . | xargs -0 sed -i '' -e 's/hello/world/g'
...which will almost work on Linux -- the only difference is that you need to remove the '' argument to sed. If you want something that's fully portable between Linux and macOS, you need to specify a backup suffix (and attach it directly to the -i option, as in -i.bkp).
The grep options to recursively search for files are best avoided - they just clutter up your grep args and make your scripts non-portable. There's already a perfectly good tool designed to find files with a very obvious name.
Are you just trying to replace hello with world in all your files? If so that's just
find . -type f |
while IFS= read -r file; do
sed 's/hello/world/g' "$file" > "tmp$$" &&
mv "tmp$$" "$file"
done
That'll work in any shell on any UNIX box unless your file names contain newlines. If you didn't want to change timestamps etc. on files that don't contain hello one way is:
find . -type f -exec grep -q 'hello' {} \; -print |
while IFS= read -r file; do
sed 's/hello/world/g' "$file" > "tmp$$" &&
mv "tmp$$" "$file"
done

How to copy files found with grep on OSX

I'm wanting to copy files I've found with grep on an OSX system, where the cp command doesn't have a -t option.
A previous posts' solution for doing something like this relied on the -t flag in cp. However, like that poster, I want to take the file list I receive from grep and then execute a command over it, something like:
grep -lr "foo" --include=*.txt * 2>/dev/null | xargs cp -t /path/to/targetdir
Less efficient than cp -t, but this works:
grep -lr "foo" --include=*.txt * 2>/dev/null |
xargs -I{} cp "{}" /path/to/targetdir
Explanation:
For filenames | xargs cp -t destination, xargs changes the incoming filenames into this format:
cp -t destination filename1 ... filenameN
i.e., it only runs cp once (actually, once for every few thousand filenames -- xargs breaks the command line up if it would be too long for the shell).
For filenames | xargs -I{} cp "{}" destination, on the other hand, xargs changes the incoming filenames into this format:
cp "filename1" destination
...
cp "filenameN" destination
i.e., it runs cp once for each incoming filename, which is much slower. For a large number (e.g., >10k) of very small (e.g., <10k) files, I'd guess it could even be thousands of times slower. But it does work :)
PS: Another popular technique is use find's exec function instead of xargs, e.g., https://stackoverflow.com/a/5241677/1563960
Yet another option is, if you have admin privileges or can persuade your sysadmin, to install the coreutils package as suggested here, and follow the steps but for cp rather than ls.

Scaling up grep find and copy to large folder (xargs?)

I would like to search a directory for any file that matches any of a list of words. If a file matches, I would like to copy that file into a new directory. I created a small batch of test files and got the following code working:
cp `grep -lir 'word\|word2\|word3\|word4\|word5' '/Users/originallocation'` '/Users/newlocation'
Unfortunately, when I run this code on a large folder with a few thousand files it says the argument list is too long for cp. I think I need to loop this or use a xargs but I can't figure out how to make the conversion.
The minimal change from what you have would be:
grep -lir 'word\|word2\|word3\|word4\|word5' '/Users/originallocation' | \
xargs cp -t '/Users/newlocation'
But, don't use that. Because you never know when you will encounter a filename with spaces or newlines in it, null-terminated strings should be used. On linux/GNU, add the -Z option to grep and -0 to xargs:
grep -Zlir 'word\|word2\|word3\|word4\|word5' '/Users/originallocation' | \
xargs -0 cp -t '/Users/newlocation'
On Macs (and AIX, HP-UX, Solaris, *BSD), the grep options change slightly but, more importantly, the GNU cp -t option is not available. A workaround is:
grep -lir --null 'word\|word2\|word3\|word4\|word5' '/Users/originallocation' | \
xargs -0 -I fname cp fname '/Users/newlocation'
This is less efficient because a new instance of cp has to be run for each file to be copied.
Alternative solution for those without grep -r. Using find + egrep + xargs , hope there is no file with same file name in different folders. Secondly, I replaced the ugly style of word\|word2\|word3\|word4\|word5
find . -type f -exec egrep -l 'word|word2|word3|word4|word5' {} \; |xargs -i cp {} /LARGE_FOLDER

Best way to do a find/replace in several files?

what's the best way to do this? I'm no command line warrior, but I was thinking there's possibly a way of using grep and cat.
I just want to replace a string that occurs in a folder and sub-folders. what's the best way to do this? I'm running ubuntu if that matters.
I'll throw in another example for folks using ag, The Silver Searcher to do find/replace operations on multiple files.
Complete example:
ag -l "search string" | xargs sed -i '' -e 's/from/to/g'
If we break this down, what we get is:
# returns a list of files containing matching string
ag -l "search string"
Next, we have:
# consume the list of piped files and prepare to run foregoing command
# for each file delimited by newline
xargs
Finally, the string replacement command:
# -i '' means edit files in place and the '' means do not create a backup
# -e 's/from/to/g' specifies the command to run, in this case,
# global, search and replace
sed -i '' -e 's/from/to/g'
find . -type f -print0 | xargs -0 -n 1 sed -i -e 's/from/to/g'
The first part of that is a find command to find the files you want to change. You may need to modify that appropriately. The xargs command takes every file the find found and applies the sed command to it. The sed command takes every instance of from and replaces it with to. That's a standard regular expression, so modify it as you need.
If you are using svn beware. Your .svn-directories will be search and replaced as well. You have to exclude those, e.g., like this:
find . ! -regex ".*[/]\.svn[/]?.*" -type f -print0 | xargs -0 -n 1 sed -i -e 's/from/to/g'
or
find . -name .svn -prune -o -type f -print0 | xargs -0 -n 1 sed -i -e 's/from/to/g'
As Paul said, you want to first find the files you want to edit and then edit them. An alternative to using find is to use GNU grep (the default on Ubuntu), e.g.:
grep -r -l from . | xargs -0 -n 1 sed -i -e 's/from/to/g'
You can also use ack-grep (sudo apt-get install ack-grep or visit http://petdance.com/ack/) as well, if you know you only want a certain type of file, and want to ignore things in version control directories. e.g., if you only want text files,
ack -l --print0 --text from | xargs -0 -n 1 sed -i -e 's/from/to/g'
# `from` here is an arbitrary commonly occurring keyword
An alternative to using sed is to use perl which can process multiple files per command, e.g.,
grep -r -l from . | xargs perl -pi.bak -e 's/from/to/g'
Here, perl is told to edit in place, making a .bak file first.
You can combine any of the left-hand sides of the pipe with the right-hand sides, depending on your preference.
An alternative to sed is using rpl (e.g. available from http://rpl.sourceforge.net/ or your GNU/Linux distribution), like rpl --recursive --verbose --whole-words 'F' 'A' grades/
For convenience, I took Ulysse's answer (after correcting the undesirable error printing) and turned it into a .zshrc / .bashrc function:
function find-and-replace() {
ag -l "$1" | xargs sed -i -e s/"$1"/"$2"/g
}
Usage: find-and-replace Foo Bar
The typical (find|grep|ack|ag|rg)-xargs-sed combination has a few problems:
Difficult to remember and get correct. Eg, forgetting the xargs -r option will run the command even when no files are found, potentially causing problems.
Retrieving the file list, and the actual replacement uses different CLI tools and can have a different search behaviour.
These problems were big enough for such an invasive and dangerous operation as recursive search-and-replace, to start the development of a dedicated tool: mo.
Early tests seem to indicate that its performance is between ag and rg and it solves following problems I encounter with them:
A single invocation can filter on filename and content. Following command searches for the word bug in all source files that have a v1 indication:
mo -f 'src/.*v1.*' -p bug -w
Once the search results are OK, actual replacement for bug with fix can be added:
mo -f 'src/.*v1.*' -p bug -w -r fix
comment() {
}
doc() {
}
function agr {
doc 'usage: from=sth to=another agr [ag-args]'
comment -l --files-with-matches
ag -0 -l "$from" "${#}" | pre-files "$from" "$to"
}
pre-files() {
doc 'stdin should be null-separated list of files that need replacement; $1 the string to replace, $2 the replacement.'
comment '-i backs up original input files with the supplied extension (leave empty for no backup; needed for in-place replacement.)(do not put whitespace between -i and its arg.)'
comment '-r, --no-run-if-empty
If the standard input does not contain any nonblanks,
do not run the command. Normally, the command is run
once even if there is no input. This option is a GNU
extension.'
AGR_FROM="$1" AGR_TO="$2" xargs -r0 perl -pi.pbak -e 's/$ENV{AGR_FROM}/$ENV{AGR_TO}/g'
}
You can use it like this:
from=str1 to=sth agr path1 path2 ...
Supply no paths to make it use the current directory.
Note that ag, xargs, and perl need to be installed and on PATH.

Resources