I have a batch processing script in GIMP that, for every image file in a directory, involves importing the image, importing a background image as a layer, doing some edits, and exporting the image. The edits take no time at all but the gimp-file-load, gimp-file-load-layer, and gimp-file-save steps take a combined total of 3-4 seconds for a 69x96 .tga image and so the batch process will take the better part of a day to handle thousands of files.
Is there a faster way to import/export these images GIMP? Maybe I can eliminate the background import step by keeping the background image open until the batch process is complete. But then what would I use in place of
(gimp-file-load-layer 1 image background)
to add the background image as a layer? I don't know of any procedures that can transfer data between two images, open in GIMP or not, without using clipboard (which I'm already using to transfer alpha channel data) or file-load.
Not really an answer but too long for a comment:
Using two 200x200 TGA files (filled with plasma):
import time
times=[]
times.append(time.time())
image = pdb.gimp_file_load("/tmp/TGA/TGA-200x200-in.tga","/tmp/TGA/TGA-200x200-in.tga")
times.append(time.time())
layer = pdb.gimp_file_load_layer(image, "/tmp/TGA/TGA-200x200-in2.tga")
times.append(time.time())
pdb.gimp_image_add_layer(image, layer, 0)
times.append(time.time())
layerOut = pdb.gimp_image_flatten(image)
times.append(time.time())
pdb.file_tga_save(image,layerOut,"/tmp/TGA/TGA-200x200-out.tga","/tmp/TGA/TGA-200x200-out.tga", 1, 0)
times.append(time.time())
print "Steps:", [ "%5.1fms" % ((e-s)*1000) for s,e in zip(times[:-1],times[1:])]
print "Total: %5.1fms" % ((times[-1]-times[0])*1000)
Yields:
Steps: [' 97.7ms', '106.3ms', ' 20.6ms', ' 22.2ms', '102.6ms']
Total: 349.4ms
So this is 10 times faster for me. Tried variations (using file-save instead of file-tga-save for instance) without any significant changes in running time.
Yes, this is Python but AFAIK this ends up running the same code in Gimp (otherwise you have a solution...). So IMHO you have an I/O bottleneck.
Measurements on Core i5-9400H 2.50GHz with SSD, running Linux and an ext4 file system (which could be another solution...).
Related
I have a folder with about 750'000 images. Some images will change over time and new images will also be added every now and then. The folder-structure is about 4-5 levels deep with a maximum of 70'000 images per one single folder.
I now want to write a script that can do the following:
Loop through all the files
Check if the file is new (has not yet been converted) or changed since the last conversion
Convert the file from jpg or png to webp if above rules apply
My current solution is a python script that writes the conversion-times into a sqlite database. It works, but is really slow. I also thought about doing it in PowerShell due to better performance (I assume) but had no efficient way of storing the conversion-times.
What language would you recommend? Is there another way to convert jpg to webp without having to exernally call the command cwebp from within my script?
I am looking to convert 15 Million 12.8 mb Sony .ARW files to .jpg
I have figured out how to do it using sips on the Command line BUT what I need is to make adjustments to the raw image settings: Contrast, Highlights, Blacks, Saturation, Vibrance, and most importantly Dehaze. I would be applying the same settings to every single photo.
It seems like ImageMagick should work if I can make adjustments for how to incorporate Dehaze but I can't seem to get ImageMagick to work.
I have done benchmark testing comparing Lightroom Classic / Photoshop / Bridge / RAW Power / and a few other programs. Raw Power is fastest by far (on a M1 Mac Mini 16GB Ram) but Raw Power doesn't allow me to process multiple folders at once.
I do a lot of scripting / actions with photoshop - but in this case photoshop is by far the slowest option. I believe this is because it opens each photo.
That's 200TB of input images, without even allowing any storage space for output images. It's also 173 solid days of 24 hr/day processing, assuming you can do 1 image per second - which I doubt.
You may want to speak to Fred Weinhaus #fmw42 about his Retinex script (search for "hazy" on that page), which does a rather wonderful job of haze removal. Your project sounds distinctly commercial.
© Fred Weinhaus - Fred's ImageMagick scripts
If/when you get a script that does what you want, I would suggest using GNU Parallel to get decent performance. I would also think you may want to consider porting, or having ported, Fred's algorithm to C++ or Python to run with OpenCV rather than ImageMagick.
So, say you have a 24-core MacPro, and a bash script called ProcessOne that takes the name of a Sony ARW image as parameter, you could run:
find . -iname \*.arw -print0 | parallel --progress -0 ProcessOne {}
and that will recurse in the current directory finding all Sony ARW files and passing them into GNU Parallel, which will then keep all 24-cores busy until the whole lot are done. You can specify fewer, or more jobs in parallel with, say, parallel -j 8 ...
Note 1: You could also list the names of additional servers in your network and it will spread the load across them too. GNU Parallel is capable of transferring the images to remote servers along with the jobs, but I'd have to question whether it makes sense to do that for this task - you'd probably want to put a subset of the images on each server with its own local disk I/O and run the servers independently yourself rather than distributing from a single point globally.
Note 2: You will want your disks well configured to handle multiple, parallel I/O streams.
Note 3: If you do write a script to process an image, write it so that it accepts multiple filenames as parameters, then you can run parallel -X and it will pass as many filenames as your sysctl parameter kern.argmax allows. That way you won't need a whole bash or OpenCV C/C++ process per image.
I'm generating animated GIF files from multiple source images using Ruby. I need to maximize throughput / minimize time spent to create each GIF. I'd prefer to keep the source images in memory (probably Memcached) rather than read them from disc every time I need them. I've been using convert in backticks to execute imagemagick commands directly from Ruby, e.g.
`convert -delay #{delay} -page #{w}x#{h}+0+0 src01.gif... etc`
I slightly prefer this over RMagick as I've found more examples, I can reference the ImageMagick docs directly. It seems that images passed to the convert command need to be paths to images on disc. Additionally it seems like the output of the convert commend is a file path so the generated image would be written to disc by ImageMagick and I'd need to read it back off disc using Ruby to access the resulting image data. It seems like I'm making ImageMagick read the source images from disc each time and write the generated GIF to disc each time. I think this is likely to be a bottleneck and unnecessary as I don't need to persist the generated images I just need to access their image data in Ruby momentarily.
I noticed that RMagick methods can take Magick::Images as parameters instead of filepaths. I could keep the source images in memory in this case. Additionally RMagick returns the generated image as data to Ruby which is what I need, I don't need it written to disc.
I'm thinking of using RMagick instead of
`convert...`
to reduce disc activity.
So question 1: Does this make sense though? Since RMagick presumably wraps ImageMagick, is RMagick actually reading and writing to disc under the hood or does it have some way of utilizing ImageMagick without disc activity?
And question 2: Is there any way to get image data in and out of ImageMagick's convert command without disc activity?
Hope this makes sense. Just trying to wrap my head around this and apologize if I'm unclear.
Does this make sense though?
Not really. We can argue about open fd's, and cost of shell environments over direct API, but there wouldn't be any disk I/O benefit between the convert utility & RMagick.
Is there any way to get image data in and out of ImageMagick's convert command without disc activity?
ImageMagick ships with stream utility. There's not much usage-documentation, but it could be leveraged to extract the image data to a blob that can be distributed via memcached.
There's also the mpr: protocol to handle label based memory access, but that might not be the distributed solution your looking for. Plus data is removed at time of process completion.
Personally, Marks comment about RAMdisk would be something I would recommend. A simple memory/tmpfs mount is easy to set-up on a system, and then it would just be a matter of updating policy.xml configuration to use said mount as a temporary directory.
I have a service which produces pdf files. I have PS-compatible printers. For printing the pdf files, I use ghostscript to convert them to ps an copy them to a shared (windows) print queue. Most of the pdf-files contain just a few pages (<10) and don't cause any trouble.
From time to time I have to print large files (100+, 500+, 5000+) pages and there I observe the following:
converting to ps is fast for the first couple of pages, then slows down. The further the progress, the longer the time for a single page.
after conversion, copying to the print queue works without problems
when copying is finished and it comes to sending the document to the printer, I observe more or less the same phenomenon: the further the progress, the slower the transfer.
Here is how I convert pdf to ps:
"C:\Program Files\gs\gs9.07\bin\gswin64c.exe" \
-dSAFER -dNOPAUSE -DBATCH \
-sOutputFile=D:\temp\testGS\test.ps \
-sDEVICE=ps2write \
D:\temp\testGS\test.pdf
After this conversion I simply copy it to the print queue
copy /B test.ps \\printserever\myPSQueue
What possibilities do I have to print large files this way?
My first idea was to do the following:
"C:\Program Files\gs\gs9.07\bin\gswin64c.exe" \
-dSAFER -dNOPAUSE -DBATCH \
-sOutputFile=D:\temp\testGS\test%05d.ps \
-sDEVICE=ps2write \
D:\temp\testGS\test.pdf
Working with single pages speeds up the conversion, it doesn't slow down after every single page, and also printing is fast, when I copy every single page as own ps file to the printer. But there is one problem I will encounter sooner or later: when I copy the single ps files, they will be single print jobs. Even when they are sorted in the correct order, if someone else starts a print job on the same printer in between, the printings will all get mixed up.
The other idea was using gsPrint, which works considerable fast, but with gsPrint I need the printer to be installed locally, which is not manageable in my environment with 300+ printers at different locations.
Can anyone tell me exactly, what happens? Is this a bad way to print? Does any have a suggestion how to solve the task of printing such documents in such an environment?
Without seeing an example PDF file its difficult to say much about why it should print slowly. However the most likely explanation is that the PDF is being rendered to an image, probably because it contains transparency.
This will result in a large image, created at the default resolution of the device (720 dpi), which is almost certainly higher than required for your printer(s). This means that a latge amount of time is spent transmitting extra data to the printer, which the PostScript interpreter in the printer then has to discard.
Using gsprint renders the file to the resolution of the device, assuming this is less than 720 dpi the resulting PostScript will be smaller therefore requiring less time to transmit, less time to decompress on the printer and less time spent throwing away extra data.
One reason the speed decreases is because of the way ps2write works, it maintains much of the final content in temporary files, and stitches the main file back together from those files. It also maintains a cross reference table which grows as the number of objects int eh file does. Unless you need the files to be continuous you could create a number of print files by using the -dFirstPage and -dLastPage options so that only a subset of the final printout is created, this might improve the performance.
Note that ps2write does not render the incoming file to an image, while gsprint definitely does, the PostScript emerging from gsprint will simply define a big bitmap. This doesn't mantain colours (everything goes to RGB) and doesn't maintain vector objects as vectors, so it doesn't scale well. However.... If you want to use gsprint to print to a remote printer, you can set up a 'virtual printer' using RedMon. You can have RedMon send the output from a port to a totally different printer, even a remote one. So you use gsprint to print to (eg) 'local instance of MyPrinter' on RedMon1: and have the RedMon port set up to capture the print stream to disk and then send the PostScript file to 'MyPrinter on another PC'. Though I'd guess thats probably not going to be any faster.
My suggestion would be to set the resolution of ps2write lower; -r300 should be enough for any printer, and lower may be possible. The resolution will only affect rendered output, everything else remains as vectors and so scales nicely. Rendered images will print perfectly well at half the resolution of the printer, in general.
I can't say why the printer becomes so slow with the Ghostscript generated PostScript, but you might want to give other converters a try, like pdftops from the Poppler utils (I found a Windows download here as you seem to be using Windows).
Having approximately 600GB of photos collected over 13 years - now stored on freebsd zfs/server.
Photos comes from family computers, from several partial backups to different external USB HDDs, reconstructed images from disk disasters, from different photo manipulation softwares (iPhoto, Picassa, HP and many others :( ) in several deep subdirectories - shortly = TERRIBLE MESS with many duplicates.
So in the first i done:
searched the the tree for the same size files (fast) and make md5 checksum for those.
collected duplicated images (same size + same md5 = duplicate)
This helped a lot, but here are still MANY MANY duplicates:
photos what are different only with exif/iptc data added by some photo management software, but the image is the same (or at least "looks as same" and have the same dimensions)
or they are only a resized versions of the original image
or they are the "enhanced" versions of originals, etc..
Now the questions:
how to find duplicates withg checksuming only the "pure image bytes" in a JPG without exif/IPTC and like meta informations? So, want filter out the photo-duplicates, what are different only with exif tags, but the image is the same. (therefore file checksuming doesn't works, but image checksuming could...). This is (i hope) not very complicated - but need some direction.
What perl module can extract the "pure" image data from an JPG file what is usable for comparison/checksuming?
More complex
how to find "similar" images, what are only the
resized versions of the originals
"enchanced" versions of the originals (from some photo manipulation programs)
is here already any algorithm available in a unix command form or perl module (XS?) what i can use to detect these special "duplicates"?
I'm able make complex scripts is BASH and "+-" :) know perl.. Can use FreeBSD/Linux utilities directly on the server and over the network can use OS X (but working with 600GB over the LAN not the fastest way)...
My rough idea:
delete images only at the end of workflow
use Image::ExifTool script for collecting duplicate image data based on image-creation date, and camera model (maybe other exif data too).
make checksum of pure image data (or extract histogram - same images should have the same histogram) - not sure about this
use some similarity detection for finding duplicates based on resize and foto enhancement - no idea how to do...
Any idea, help, any (software/algorithm) hint how to make order in the chaos?
Ps:
Here is nearly identical question: Finding Duplicate image files but i'm already done with the answer (md5). and looking for more precise checksuming and image comparing algorithms.
Assuming you can work with localy mounted FS:
rmlint : fastest tool I've ever used to find exact duplicates
findimagedupes : automatize the whole ImageMagick way (as Randal Schwartz's script that I haven't tested? it seems)
Detecting Similar and Identical Images Using Perseptual Hashes goes all the way (a great reference post)
dupeguru-pe (gui) : dedicated tool that is fast and does an excellent job
geeqie (gui) : I find it fast/excellent to finish the job, using the granular deduplication options. Also then you can generate an ordered collection of images such that 'simular images are next to each other, allowing you to 'flip' between the two to see the changes.
Have you looked at this article by Randal Schwartz? He uses a perl script with ImageMagick to compare resized (4x4 RGB grid) versions of the pictures that he then compares in order to flag "similar" pictures.
You can remove exif data with mogrify -strip from ImageMagick toolset. So you could, for each image, copy it without exif, md5sum, and then compare md5sums.
When it comes to visually similar messages - you can, for example, use compare (also from ImageMagick toolset), and produce black/white diff map, like described here, then make histogram of the difference and check if there is "enough" white to mean that it's different.
I had a similar dilemma - several hundred gigs of photos and videos spread and duplicated over about a dozen drives. I know this may not be the exact way you are looking for, but the FSlint Janitor application (on Ubuntu 16.x, then 18.x) was a lifesaver for me. I took the project in chunks, eventually cleaning it all up and ended up with three complete sets (I wanted two off-site backups).
FSLint Janitor:
sudo apt install fslint