Finding Duplicate image files - image

I have around 1 TB of images, stored in my hard disk. These are pictures taken over time of friends and family. Many of these pictures are duplicates, in the sense, same file saved in different locations, probably with different name too. I want to ask is there any tool, utility or approach(I can code one ) to find out the duplicate files.

I would recommend using md5deep or sha1deep. On Linux simply install package md5deep (it is included in most Linux distributions).
Once you have it installed, simply run it in recursive mode over your whole disk and save checksums for every file on your disk into text file using command like this:
md5deep -r -l . > filelist.txt
If you like sha1 better than md5, use sha1deep instead (it is part of the same package).
Once you have a file, simply sort it using sort (or pipe it into sort in previous step):
sort < filelist.txt > filelist_sorted.txt
Now, simply look at the result using any text editor - you will quickly see all the duplicates alongside with their locations on disk.
If you are so inclined, you can write simple script in Perl or Python to remove duplicates based on this file list.

Related

Merge two text files

I'm using Windows and Notepad++ to separate file in txt. I have 2 files which is I have to merge it side by side or line by line for my data analysis.
Here is the example:
file1.txt
Abcdefghijk
abcdefghijk
file2.txt
123456
123456
then the output I want is like this:
Abcdefghijk123456
abcdefghijk123456
in the next file or output file. Does anybody here know how to do this?
Your question answered here by TheMadTechnician. Using powershell, you should take both source files (1 and 2) as arrays of lines. Then comes simple cycle, like "merge line x from file1 with line x from file2 as long you have some lines in file1".
Unfortunately its impossible with pure cmd.
#riki.. you could also write a batch program to do this pro grammatically. There should probably be no limit over the number of lines.
It may depend on the number of lines you're having in each files. I suggest to copy paste the same if it is less than 50 lines.
Otherwise,
use some powerful languages like python, c,php etc. And make it run before performing data analysis.
There is a free utility you can download and run on your computer, called txtcollector. I read about it here. I used it because I had a whole folder of files to concatenate. It was a breeze. The only slight imperfection I noticed was that I couldn't paste in the path to the specific folder in the first step (choosing the folder where the files to be concatenated were). However, I could do this when choosing where to save the result.

How to extract specific lines from a huge data file?

I have a very large data file, about 32GB. The file is made up of about 130k lines, each of which mainly contains numbers, but also has few characters.
The task I need to perform is very clear: I have to extract 20 lines and write them to a new text file.
I know the exact line number for each of the 20 lines that I want to copy.
So the question is: how can I extract the content at a specific line number from the large file? I am on Windows. Is there a tool that can do such sort of operations, or I need to write some code?
If there is no direct way of doing that, I was thinking that a possible approach is to first extract small blocks of the original file (so that each block contains one or more lines to extract) and then use a standard editor to find the lines within each block. In this case, the question would be: how can I split a large file in blocks by line on windows? I use a tool named HJ-Split which works very well with large files, but it can only split by size, not by line.
Install[1] Babun Shell (or Cygwin, but I recommend the Babun), and then use sed command as described here: How can I extract a predetermined range of lines from a text file on Unix?
[1] Installing Babun means actually just unzipping it somewhere, so you don't have to have the Administrator rights on the server.

Handle single files while extracting tar.gz

I am having a huge .tgz file which is further structured inside like this:
./RandomFoldername1/file1
./RandomFoldername1/file2
./RandomFoldername2/file1
./RandomFoldername2/file2
etc
What I want to do is having each individual file extracted to standard output so that I can pipe it afterwards to another command. While doing this, I also need to get the RandomFoldername name and file name so that I can deal with them properly from within the second command.
Till now the options I have are
to either extract all of the tarball and deal with the structured files that I will be having, which is not an option since the extracted tar doesn't fit into the hard drive
Make a loop that pattern match each file and extract one file at time. This option although that solves the problem, is too slow because the tarball is sweeped each time for only one file.
While searching on how to solve this, I've started to fear that there is no better alternative to this.
Using tar the tool I don't believe you have any other options.
Using a tar library for some language of your choice should allow you to do what you want though as it should let you iterate over the entries in the tarball one-by-one and allow you to extract/pipe/etc. each file one-by-one as necessary.

Split and rejoin wav files

I'm trying to edit around 200 wav files with a windows program that won't support command line batch sort of stuff. So it seems like the easiest way to do it would be to combine the wavs into one file (they're all short), and then split them back the way they are after editing.
Sox will give me the length, and I already have the names of course. Is there any way to say, combine all the wavs in a directory into a single wav file, while preserving the names, lengths, and which order they were combined in a txt file, and then use the txt to turn them back into wavs with the original names and lengths?
Edit: I seem to be doing something wrong. I ran this script first:
#!/bin/bash
for f in *.wav
do
dd if=$f of=new_$f bs=1 skip=44
done
Then I moved all of the original files out of the folder, deleted the first of the new files, and copied the first of the originals back in. Then I did this:
cat *.wav > merged.wav
This gives me one file that's as big as it should be, but when I open it with a media player, it just plays the portion that was the first file, and then stops before playing the others.
dont know how creative you want to get. A wav is just a headder with binary data. So long as theyre all the same format, sample size everything, you can use cut or split to strip 44 bytes off the beginming of all of them, keeping one copy of the headder at the beginning, cat them into 1 file, do what you want to them, split it back up using another script with the same list of filenames.
Sox can do this.
Assuming all wav files are in c:\temp the command is
sox c:\temp\*.wav c:\temp\merged.wav
(The example is for windows, for linux use linux path notation)
For preserving the length and names I'd use sox to get the length and
then create a cue-file from that info.
This cue-file can later get used to split the audio.

Diff for 3 binary files

I have 3 binary files. Let's call them file1.bin, file2.bin and file3.bin.
file1.bin and file2.bin have some common parts.
file2.bin and file3.bin have some common parts.
I want to find the common parts between file1.bin and file2.bin that are different between file2.bin and file3.bin.
How do you recommend to accomplish that? I have already dumped the binary files to text files using xxd and then did a 3-way diff using vim -d file1.txt file2.txt file3.txt.
However, vim marks a part as changed in all the files even if it has only changed in one file and remains the same in the other two files. I want those special kind of occurrences to be marked differently.
Perhaps you can use the built-in unix diff (I think it is part of OSX), but use the --unchanged-group-format to list the similarities. Do that for file1 and file 2. Then do it for file2 and file3. You can then do a regular diff on the two resulting files.
For an idea of how to get the similarities, have a look at this post.
The tool that I work for (ECMerge) does that. You just have to diff the 3 binary files, it will present equal portions in front of each other, and modified bytes appropriately placed in between. No need to first get an hex dump. You can script in JavaScript to output whatever you like based on the diff results and the bytes in the files (it works also in command line).
Chromium uses bsdiff, then switched to courgette for doing binary diff as explained in their blog here. You might find useful leads from their blog.

Resources