process 100K of image files with bash - bash

here is the script to optimize jpg images: https://github.com/kormoc/imgopt/blob/master/imgopt
There is a CMS with image files (not mine).
I assume there is a complicated structure of subdirectories and script just recursively find all img files in given folder.
The question is how to mark already processed files so with next run
script won't touch them and just skip?
I dont know when the guys would like to add new files to it and process it. Also I think renaming is not a good choice either.
I was thinking about hash-table or associative array which will be filled from txt file during
start. But is it ok to have 100K of items array in bash? Seems complicated for a script.
Any other ideas about optimization are also welcome.

I think the easiest thing to do is just output a file with a similar name per processed image file.
For example image1.jpg after being processed would have an empty file with a similar name e.g. .image1.jpg.processed.
Then when your script runs it just checks if the for the current image: NAME.EXT if a file .NAME.EXT.processed exists. If the file doesn't exist then you know it needs to be processed. No memory issues and no hashtable needed granted you will have 100K of empty extra files.

Related

How to replace all files with a specific pixel size, and keep their names?

I have a .png file and a folder with multiple folders in it, that all contain multiple .png files, e.g. '1.png', '2.png' and so on. What I'd like to do is to replace the contents of all those files to the first named file, but keep their names.
I'm pretty sure this is doable and I think I found an answer to do this in one folder using command prompt on Windows, but I'd also like to only replace files that have a specific file size (in pixels, and not the same size as the file I'm replacing them to, just a specific file size).
I prefer to do this simply using a batch file, so if anyone can help that'd be appreciated a lot. If it's not possible to meet the file size criteria with a batch file, but it is possible to do this for a whole directory tree with .png files, that gets me a long way as well.
Thank you in advance!
Edit: In the comments Stephan mentioned you can't get the pixel size of a file, so it turns out this isn't possible. I'm not going to bother with an external application.

Search an image in a directory

I have a project in which I have lots of different images. Once in a while, we are adding more images inside it, but before, we need to check if it already existed (because we added it previously).
We were doing this right now manually, looking for the image in the folders, but as the project got bigger, it's pretty time consuming.
SO, I would like to create a script that given an image, it looks in a directory to check if it exists.
Do you know if there is any command line based tool or something I can use to build a script to do this?
There is the fdupes utility which does byte to byte comparison. It has a -d or --delete option which will prompt you to ask which files it should keep when it finds duplicates. If you don't care about the filename you can ask it to keep only the first one:
fdupes --delete --noprompt
If you want to delete images that look the same but are slightly different, it's an image recognition problem which I guess does not have such a straightforward solution.

Handle single files while extracting tar.gz

I am having a huge .tgz file which is further structured inside like this:
./RandomFoldername1/file1
./RandomFoldername1/file2
./RandomFoldername2/file1
./RandomFoldername2/file2
etc
What I want to do is having each individual file extracted to standard output so that I can pipe it afterwards to another command. While doing this, I also need to get the RandomFoldername name and file name so that I can deal with them properly from within the second command.
Till now the options I have are
to either extract all of the tarball and deal with the structured files that I will be having, which is not an option since the extracted tar doesn't fit into the hard drive
Make a loop that pattern match each file and extract one file at time. This option although that solves the problem, is too slow because the tarball is sweeped each time for only one file.
While searching on how to solve this, I've started to fear that there is no better alternative to this.
Using tar the tool I don't believe you have any other options.
Using a tar library for some language of your choice should allow you to do what you want though as it should let you iterate over the entries in the tarball one-by-one and allow you to extract/pipe/etc. each file one-by-one as necessary.

ruby - get a file from directory without listing all contents

I'm using the split linux command to split huge xml files into node-sized ones. The problem is now I have directory with hundreds of thousands of files.
I want a way to get a file from the directory (to pass to another process for import into our database) without needing to list everything in it. Is this how Dir.foreach already works? Any other ideas?
You can use Dir.glob to find the files you need. More details here, but basically, you pass it a pattern like Dir.glob 'dir/*.rb' and get back filenames matching that pattern. I assume it's done in a reasonably good way, but it will depend on your platform and implementation.
As to Dir.foreach, this should be efficient too - the concern would be if it has to process the entire directory for every pass around the loop. But that would be awful implementation, and is not the case.

Utility to hash and list files with identical contents?

UltraEdit saves temporary, ie. unsaved/untitled, files as (regex) "Edit.\d+".
When UltraEdit is killed (I do this when some software nags me to reboot), I noticed that it doesn't always save files in the same directory, so I end up with a bunch of "Edit.\d+" files scattered in my two hard-disks, with a lot of identical contents.
So I'd like a free utility for Windows that can...
search my hard-disks for all files whose filename matches "Edit.\d+"
generate some hashing of the file so it has some signature, and
output a list of all identical files so that I don't waste time checking files that exist in multiple copies on my hard-disk, and just take care of unique files.
Anyone knows of such a thing?
Thank you.
found this: http://www.atory.com/Dupe_Checker/
can't give you a review but it looks legit

Resources