I need to find (preferably) or build an app for a lot of images.
Each image has a distinct URL. There are many thousands, so doing it manually is a huge effort.
The list is currently in an csv file. (It is essentially a list of products, each with identifying info (name, brand, barcode, etc) and a link to a product image.
I'd like to loop through the list, and download each image file. Ideally I'd like to rename each one - something like barcode.jpg.
I've looked at a number of image scrapers, but haven't found one that works quite this way.
Very appreciative of any leads to the right tool, or ideas...
Are you on Windows or Mac/Linux? In Windows you can use a powershell script for this, on mac/linux a shell script with about 1-5 lines of code.
Here's one way to do this:
# show what's inside the file
cat urlsofproducts.csv
http://bit.ly/noexist/obj101.jpg, screwdriver, blackndecker
http://bit.ly/noexist/obj102.jpg, screwdriver, acme
# this one-liner will GENERATE one download-command per item, but will not execute them
perl -MFile::Basename -F", " -anlE "say qq(wget -q \$F[0] -O '\$F[1]--\$F[2]--). basename(\$F[0]) .q(')" urlsofproducts.csv
# Output :
wget http://bit.ly/noexist/obj101.jpg -O ' screwdriver-- blackndecker--obj101.jpg'
wget http://bit.ly/noexist/obj101.jpg -O ' screwdriver-- acme--obj101.jpg'
Now back-substitute the wget commands into the shell.
If possible please use google sheets to run a function for this kind of work, I was also puzzled on this one and now found a way to by which the images are not only downloaded but those are renamed on the real time.
Kindly reply if you want the code.
Related
I'm wanting to progress through a directory's subdirectories and either convert or place .TIF images into a pdf. I have a directory structure like this:
folder
item_one
file1.TIF
file2.TIF
...
fileN.TIF
item_two
file1.TIF
file2.TIF
...
...
I'm working on a Mac and considered using sips to change my .TIF files to .PNG files and then use pdfjoin to join all the .PNG files into a single .PDF file per folder.
I have used:
for filename in *; do sips -s format png $filename --out $filename.png; done
but this only works for the .TIF files in a single directory. How would one write a shellscript to progress through a series of directories as well?
once the .PNG files were created I'd do essentially the same thing but using:
pdfjoin --a4paper --fitpaper false --rotateoversize false *.png
Is this a valid way of doing this? Is there a better, more efficient way of performing such an action? Or am I being an idiot and should be doing this with some sort of software, like ImageMagick or something?
Try using the find command with the exec switch to call your image conversion solution. Alternatively, instead of using the exec switch, you could pipe the output of find to xargs. There is lots of information online about using find. Here's one example from StackOverflow.
As far as the image conversion, I think that really depends on your requirements for speed and efficiency. If you've verified the process you described, and this is a one-time process, and it only takes seconds or minutes to run, then you're probably fine. On the other hand, if you need to do this frequently, then it might be worth investing the time to find a one-step conversion solution that takes less time than your current, two-pass solution.
Note that, instead of two passes, you may be able to pipe the output of sips to pdfjoin; however, that would require some investigation to verify.
Wget has the -H "span host" option
Span to any host—‘-H’
The ‘-H’ option turns on host spanning, thus allowing Wget's recursive run to visit any host referenced by a link. Unless sufficient recursion-limiting criteria are applied depth, these foreign hosts will typically link to yet more hosts, and so on until Wget ends up sucking up much more data than you have intended.
I want to do a recursive download (say, of level 3), and I want to get images, stylesheets, javascripts, etc. (that is, files necessary to display the page properly) even if they're outside my host. However, I don't want to follow a link to another HTML page (because then it can go to another HTML page, and so on, then the number can explode.)
Is it possible to do this somehow? It seems like the -H option controls spanning to other hosts for both the images/stylesheets/javascript case and the link case, and wget doesn't allow me to separate the two.
Downloading All Dependencies in a page
First step is downloading all the resources of a particular page. If you look in the man pages for wget you will find this:
...to download a single page and all its requisites (even if they exist on
separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to -p:
wget -E -H -k -K -p http://<site>/<document>
Getting Multiple Pages
Unfortunately, that only works per-page. You can turn on recursion with -r, but then you run into the issue of following external sites and blowing up. If you know the full list of domains that could be used for resources, you can limit it to just those using -D, but that might be hard to do. I recommend using a combination of -np (no parent directories) and -l to limit the depth of the recursion. You might start getting other sites, but it will at least be limited. If you start having issues, you could use --exclude-domains to limit the known problem causers. In the end, I think this is best:
wget -E -H -k -K -p -np -l 1 http://<site>/level
Limiting the domains
To help figure out what domains need to be included/excluded you could use this answer to grep a page or two (you would want to grep the .orig file) and list the links within them. From there you might be able to build a decent list of domains that should be included and limit it using the -D argument. Or you might at least find some domains that you don't want included and limit them using --exclude-domains. Finally, you can use the -Q argument to limit the amount of data downloaded as a safeguard to prevent filling up your disk.
Descriptions of the Arguments
-E
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this
option will cause the suffix .html to be appended to the local filename.
-H
Enable spanning across hosts when doing recursive retrieving.
-k
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the
visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets,
hyperlinks to non-HTML content, etc.
-K
When converting a file, back up the original version with a .orig suffix.
-p
This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as
inlined images, sounds, and referenced stylesheets.
-np
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files
below a certain hierarchy will be downloaded.
-l
Specify recursion maximum depth level depth.
-D
Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does not turn on -H.
--exclude-domains
Specify the domains that are not to be followed.
-Q
Specify download quota for automatic retrievals. The value can be specified in bytes (default), kilobytes (with k suffix), or megabytes (with m suffix).
Just put wget -E -H -k -K -p -r http://<site>/ to download a complete site. Don't get nervous if while downloading you open some page and its resources are not available, because when wget finishes it all, it will convert them!
for downloading all "files necessary to display the page properly" you can use -p or --page-requisites, perhaps together with -Q or --quota
Try using the wget --accept-regex flag; the posix --regex-type is compiled into wget standard but you can compile in the perl regex engine pcre if you need something more elaborate:
E.g. The following will get all pngs on external sites one level deep and any other pages that have the word google in the url:
wget -r -H -k -l 1 --regex-type posix --accept-regex "(.*google.*|.*png)" "http://www.google.com"
It doesn't actually solve the problem of dipping down multiple levels on external sites, for that you would have to probably write your own spider. But using the --accept-regex you can probably get close to what you are looking for in most cases.
Within a single layer of a domain you can check all links internally, and on third party servers with the following command.
wget --spider -nd -e robots=off -Hprb --level=1 -o wget-log -nv http://localhost
The limitation here is that it only checks a single layer. This works well with a CMS where you can flatten the site with the GET variable rather than CMS generated URLs. Otherwise you can use your favorite server side script to loop this command through directories. For a full explanation of all of the options, check out this Github commit.
https://github.com/jonathan-smalls-cc/git-hooks/blob/LAMP/contrib/pre-commit/crawlDomain.sh
My websites file structure has gotten very messy over the years from uploading random files to test different things out. I have a list of all my files such as this:
file1.html
another.html
otherstuff.php
cool.jpg
whatsthisdo.js
hmmmm.js
Is there any way I can input my list of files via command line and search the contents of all the other files on my website and output a list of the files that aren't mentioned anywhere on my other files?
For example, if cool.jpg and hmmmm.js weren't mentioned in any of my other files then it could output them in a list like this:
cool.jpg
hmmmm.js
And then any of those other files mentioned above aren't listed because they are mentioned somewhere in another file. Note: I don't want it to just automatically delete the unused files, I'll do that manually.
Also, of course I have multiple folders so it will need to search recursively from my current location and output all the unused (unreferenced) files.
I'm thinking command line would be the fastest/easiest way, unless someone knows of another. Thanks in advance for any help that you guys can be!
Yep! This is pretty easy to do with grep. In this case, you would run a command like:
$ for orphan in `cat orphans.txt`; do \
echo "Checking for presence of ${orphan} in present directory..." ;
grep -rl $orphan . ; done
And orphans.txt would look like your list of files above, one file per line. You can add -i to the grep above if you want to grep case-insensitively. And you would want to run that command in /var/www or wherever your distribution keeps its webroots. If, after you see the above "Checking for..." and no matches below, you haven't got any files matching that name.
Okay so here is what I want to do. I want to add a print option that prints whatever the user's document is to a PDF and adds some headers before sending it off to a device.
I guess my questions are: how do I add a virtual "printer" driver for the user that will launch the application I've been developing that will make the PDF (or make the PDF and launch my application with references to the newly generated PDF)? How do I interface with CUPS to generate the PDF? I'm not sure I'm being clear, so let me know if more information would be helpful.
I've worked through this printing with CUPS tutorial and seem to get everything set up okay, but the file never seems to appear in the appropriate temporary location. And if anyone is looking for a user-end PDF-printer, this cups-pdf-for-mac-os-x is one that works through the installer, however I have the same issue of no file appearing in the indicated directory when I download the source and follow the instructions in the readme. If anyone can get either of these to work on a mac through the terminal, please let me know step-by-step how you did it.
The way to go is this:
Set up a print queue with any driver you like. But I recommend to use a PostScript driver/PPD. (A PostScript PPD is one which does not contain any *cupsFilter: ... line.):
Initially, use the (educational) CUPS backend named 2dir. That one can be copied from this website: KDE Printing Developer Tools Wiki. Make sure when copying that you get the line endings right (Unix-like).
Commandline to set up the initial queue:
lpadmin \
-p pdfqueue \
-v 2dir:/tmp/pdfqueue \
-E \
-P /path/to/postscript-printer.ppd
The 2dir backend now will write all output to directory /tmp/pdfqueue/ and it will use a uniq name for each job. Each result should for now be a PostScript file. (with none of the modifications you want yet).
Locate the PPD used by this queue in /etc/cups/ppd/ (its name should be pdfqueue.ppd).
Add the following line (best, near the top of the PPD):
*cupsFilter: "application/pdf 0 -" (Make sure the *cupsFilter starts at the very beginning of the line.) This line tells cupsd to auto-setup a filtering chain that produces PDF and then call the last filter named '-' before it sends the file via a backend to a printer. That '-' filter is a special one: it does nothing, it is a passthrough filter.
Re-start the CUPS scheduler:sudo launchctl unload /System/Library/LaunchDaemons/org.cups.cupsd.plist
sudo launchctl load /System/Library/LaunchDaemons/org.cups.cupsd.plist
From now on your pdfqueue will cause each job printed to it to end up as PDF in /tmp/pdfqueue/*.pdf.
Study the 2dir backend script. It's simple Bash, and reasonably well commented.
Modify the 2dir in a way that adds your desired modifications to your PDF before saving on the result in /tmp/pdfqueue/*.pdf...
Update: Looks like I forgot 2 quotes in my originally prescribed *cupsFilter: ... line above. Sorry!
I really wish I could accept two answers because I don't think I could have done this without all of #Kurt Pfeifle 's help for Mac specifics and just understanding printer drivers and locations of files. But here's what I did:
Download the source code from codepoet cups-pdf-for-mac-os-x. (For non-macs, you can look at http://www.cups-pdf.de/) The readme is greatly detailed and if you read all of the instructions carefully, it will work, however I had a little trouble getting all the pieces, so I will outline exactly what I did in the hopes of saving someone else some trouble. For this, the directory with the source code is called "cups-pdfdownloaddir".
Compile cups-pdf.c contained in the src folder as the readme specifies:
gcc -09 -s -lcups -o cups-pdf cups-pdf.c
There may be a warning: ld: warning: option -s is obsolete and being ignored, but this posed no issue for me. Copy the binary into /usr/libexec/cups/backend. You will likely have to the sudo command, which will prompt you for your password. For example:
sudo cp /cups-pdfdownloaddir/src/cups-pdf /usr/libexec/cups/backend
Also, don't forget to change the permissions on this file--it needs root permissions (700) which can be changed with the following after moving cupd-pdf into the backend directory:
sudo chmod 700 /usr/libexec/cups/backend/cups-pdf
Edit the file contained in /cups-pdfdownloaddir/extra/cups-pdf.conf. Under the "PDF Conversion Settings" header, find a line under the GhostScript that reads #GhostScript /usr/bin/gs. I did not uncomment it in case I needed it, but simply added beneath it the line Ghostscript /usr/bin/pstopdf. (There should be no pre-cursor # for any of these modifications)
Find the line under GSCall that reads #GSCall %s -q -dCompatibilityLevel=%s -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pdfwrite -sOutputFile="%s" -dAutoRotatePage\
s=/PageByPage -dAutoFilterColorImages=false -dColorImageFilter=/FlateEncode -dPDFSETTINGS=/prepress -c .setpdfwrite \
-f %s Again without uncommenting this, under this I added the line GSCall %s %s -o %s %s
Find the line under PDFVer that reads #PDFVer 1.4 and change it to PDFVer, no spaces or following characters.
Now save and exit editing before copying this file to /etc/cups with the following command
sudo cp cups-pdfdownloaddir/extra/cups-pdf.conf /etc/cups
Be careful of editing in a text editor because newlines in UNIX and Mac environments are different and can potentially ruin scripts. You can always use a perl command to remove them, but I'm paranoid and prefer not to deal with it in the first place.
You should now be able to open a program (e.g. Word, Excel, ...) and select File >> Print and find an available printer called CUPS-PDF. Print to this printer, and you should find your pdfs in /var/spool/cups-pdf/yourusername/ by default.
*Also, I figured this might be helpful because it helped me: if something gets screwed up in following these directions and you need to start over/get rid of it, in order to remove the driver you need to (1) remove the cups-pdf backend from /usr/libexec/cups/backend (2) remove the cups-pdf.conf from /etc/cups/ (3) Go into System Preferences >> Print & Fax and delete the CUPS-PDF printer.
This is how I successfully set up a pdf backend/filter for myself, however there are more details, and other information on customization contained in the readme file. Hope this helps someone else!
I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.
This file doesn't change very frequently though, so I want to discard the downloaded file if it already exists.
Easiest way to do this?
Thanks!
Do you really need to compress the file ?
wget provides -N, --timestamping which obviously, turns on time-stamping. What that does is say your file is located at www.example.com/file.txt
The first time you do:
$ wget -N www.example.com/file.txt
[...]
[...] file.txt saved [..size..]
The next time it'll be like this:
$ wget -N www.example.com/file.txt
Server file no newer than local file “file.txt” -- not retrieving.
Except if the file on the server was updated.
That would solve your problem, if you didn't compress the file.
If you really need to compress it, then I guess I'd go with comparing the hash of the new file/archive and the old. What matters in that case is, how big is the downloaded file ? is it worth compressing it first then checking the hashes ? is it worth decompressing the old archive and comparing the hashes ? is it better to store the old hash in a txt file ? do all these have an advantage over overwriting the old file ?
You only know that, make some tests.
So if you go the hash way, consider sha256 and xz (lzma2 algorithm) compression.
I would do something like this (in Bash):
newfilesum="$(wget -q www.example.com/file.txt -O- | tee file.txt | sha256sum)"
oldfilesum="$(xzcat file.txt.xz | sha256sum)"
if [[ $newfilesum != $oldfilesum ]]; then
xz -f file.txt # overwrite with the new compressed data
else
rm file.txt
fi
and that's done;
Calculate a hash of the content of the file and check against the new one. Use for instance md5sum. You only have to save the last MD5 sum to check if the file changed.
Also, take into account that the web is evolving to give more information on pages, that is, metadata. A well-founded web site should include file version and/or date of modification (or a valid, expires header) as part of the response headers. This, and quite other things, is what makes up the scalability of Web 2.0.
How about downloading the file, and checking it against a "last saved" file?
For example, the first time it downloads myfile, and saves it as myfile-[date], and compresses it. It also adds a symbolic link, such as lastfile pointing to myfile-[date]. The next time the script runs, it can check if the contents of whatever lastfile points to is the same as the new downloaded file.
Don't know if this would work well, but it's what I could think of.
You can compare the new file with the last one using the sum command. This takes the checksum of the file. If both files have the same checksum, they are very, very likely to be exactly the same. There's another command called md5 that takes the md5 fingerprint, but the sum command is on all systems.