Utility to hash and list files with identical contents?

Utility to hash and list files with identical contents? - windows

UltraEdit saves temporary, ie. unsaved/untitled, files as (regex) "Edit.\d+".
When UltraEdit is killed (I do this when some software nags me to reboot), I noticed that it doesn't always save files in the same directory, so I end up with a bunch of "Edit.\d+" files scattered in my two hard-disks, with a lot of identical contents.
So I'd like a free utility for Windows that can...
search my hard-disks for all files whose filename matches "Edit.\d+"
generate some hashing of the file so it has some signature, and
output a list of all identical files so that I don't waste time checking files that exist in multiple copies on my hard-disk, and just take care of unique files.
Anyone knows of such a thing?
Thank you.

found this: http://www.atory.com/Dupe_Checker/
can't give you a review but it looks legit

Related

How to replace all files with a specific pixel size, and keep their names?

I have a .png file and a folder with multiple folders in it, that all contain multiple .png files, e.g. '1.png', '2.png' and so on. What I'd like to do is to replace the contents of all those files to the first named file, but keep their names.
I'm pretty sure this is doable and I think I found an answer to do this in one folder using command prompt on Windows, but I'd also like to only replace files that have a specific file size (in pixels, and not the same size as the file I'm replacing them to, just a specific file size).
I prefer to do this simply using a batch file, so if anyone can help that'd be appreciated a lot. If it's not possible to meet the file size criteria with a batch file, but it is possible to do this for a whole directory tree with .png files, that gets me a long way as well.
Thank you in advance!
Edit: In the comments Stephan mentioned you can't get the pixel size of a file, so it turns out this isn't possible. I'm not going to bother with an external application.

Rules for file extensions?

Are there any rules for file extensions? For example, I wrote some code which reads and writes a byte pattern that is only understood by that specific programm. I'm assuming my anti virus programm won't be too happy if I give it the name "pleasetrustme.exe"... Is it gerally allowed to use those extensions? And what about the lesser known ones, like ".arw"?

You can use any file extension you want (or none at all). Using standard extensions that reflect the actual type of the file just makes things more convenient. On Windows, file extensions control stuff like how the files are displayed in Windows Explorer and what happens when you double click on it.

I wrote some code which reads and writes a byte pattern that is only
understood by that specific programm.
A file extension is only an indication of what type of data will be inside, never a guarantee that certain data formatted in a specific way will be inside the file.
For your own specific data structure it is of course always best to choose an extension that is not already in use for other file formats (or use a general extension like .dat or .bin maybe). This also has the advantage of being able to use an own icon without it being overwritten by other software using the same extension - or the other way around.
But maybe even more important when creating a custom (binary?) file format, is to provide a magic number as the first bytes of that file, maybe followed by a file header structure containing a version number etc. That way your own software can first check the header data to make sure it's the right type and version (for example: anyone could rename any file type to your extension, so your program needs to have a way to do some checks inside the file before reading the remaining data).

Handle single files while extracting tar.gz

I am having a huge .tgz file which is further structured inside like this:
./RandomFoldername1/file1
./RandomFoldername1/file2
./RandomFoldername2/file1
./RandomFoldername2/file2
etc
What I want to do is having each individual file extracted to standard output so that I can pipe it afterwards to another command. While doing this, I also need to get the RandomFoldername name and file name so that I can deal with them properly from within the second command.
Till now the options I have are
to either extract all of the tarball and deal with the structured files that I will be having, which is not an option since the extracted tar doesn't fit into the hard drive
Make a loop that pattern match each file and extract one file at time. This option although that solves the problem, is too slow because the tarball is sweeped each time for only one file.
While searching on how to solve this, I've started to fear that there is no better alternative to this.

Using tar the tool I don't believe you have any other options.
Using a tar library for some language of your choice should allow you to do what you want though as it should let you iterate over the entries in the tarball one-by-one and allow you to extract/pipe/etc. each file one-by-one as necessary.

process 100K of image files with bash

here is the script to optimize jpg images: https://github.com/kormoc/imgopt/blob/master/imgopt
There is a CMS with image files (not mine).
I assume there is a complicated structure of subdirectories and script just recursively find all img files in given folder.
The question is how to mark already processed files so with next run
script won't touch them and just skip?
I dont know when the guys would like to add new files to it and process it. Also I think renaming is not a good choice either.
I was thinking about hash-table or associative array which will be filled from txt file during
start. But is it ok to have 100K of items array in bash? Seems complicated for a script.
Any other ideas about optimization are also welcome.

I think the easiest thing to do is just output a file with a similar name per processed image file.
For example image1.jpg after being processed would have an empty file with a similar name e.g. .image1.jpg.processed.
Then when your script runs it just checks if the for the current image: NAME.EXT if a file .NAME.EXT.processed exists. If the file doesn't exist then you know it needs to be processed. No memory issues and no hashtable needed granted you will have 100K of empty extra files.

Bash - Identify files not referenced by other files

I have a website that runs off an OpenWRT router. I'd like to optimize the site by removing an files that aren't being used. Here is my directory structure...
/www/images
/www/js
/www/styles
/www/otherSubDirectories <--- not really named that
I'm mostly concerned about identifying images that are not used because those take the most space. But it would also be nice to identify style sheets and javascript files that are not being used. So, is there a way I can search /www and all sub directories and files and print a list of files in /www/images, /www/js, and /www/styles that are not referenced by any other files?
When I'm looking for files that contain a specific string I use this:
find . | xargs grep -Hn 'myImage.jpg'
That would tell me all files that reference the image. Maybe some variation of that?
Any help would be appreciated!
EV

Swiss File Knife is very nice tool.
Find out which files are used (referenced) by other files through fuzzy content analysis

Consider using a cross-reference program (for example, lxr) for this problem. (I haven't verified if lxr can do the job, but believe it can.) If an off-the-shelf cross-reference program doesn't work, look for an open source cross-reference program in a language you know, and adapt it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Utility to hash and list files with identical contents? - windows

found this: http://www.atory.com/Dupe_Checker/ can't give you a review but it looks legit

Related

How to replace all files with a specific pixel size, and keep their names?

Rules for file extensions?

Handle single files while extracting tar.gz

process 100K of image files with bash

Bash - Identify files not referenced by other files

Categories

Resources