Finding and Converting HTML Files and Moving Them En-Masse - wolfram-mathematica

I'm using Mathematica to work with a large array of website files, which I've mirrored onto my own system. They are spread across several hundred directories, with tons of sub-directories. So for example, I have:
/users/me/test/directory1
/users/me/test/directory1/subdirectory2 [times a hundred]
/users/me/test/directory2
/users/me/test/directory2/subdirectory5 [etc. etc.]
What I need to do is to go into each directory, Import[] all the HTML files as Plaintext, and then put them in another directory elsewhere on my system named after 'directory1'. So far, with Do[] loops I have been able to do a rough version: the best case I have right now, however, is dumping the ".txt" files in the original directory, which isn't an ideal solution as they're still spread all over my system.
To find my files, I use directoryfiles = FileNames["*.htm*", {"*"}, Infinity];
Some additional vexing problems:
(1) Duplicates: Is there a way for Mathematica to deal with duplicates - i.e. if we run into another index_en.html can it be renamed as index_en_1.html?
(2) Directories: Because of all the directories, unless I use Mathematica to constantly SetDirectory and CreateDirectory over and over again, it keeps running into trouble.
This all seems a bit confusing. Basically, is there an efficient way for Mathematica to find a ton of HTML files spread across hundreds of directories/subdirectories, Import them as plaintext, and export them somewhere else [it's important for me to know they came from directory1, but that's it].
-- edited for clarity below --
Here is the code that I currently have:
SetDirectory[
"/users/me/web/"];
dirlist = FileNames[];
directoryPrefix =
"/users/me/web/";
plainHTMLBucket = "";
Do[
directory = directoryPrefix <> dirname;
exportPrefix =
"/users/me/desktop/bucket/";
SetDirectory[directory];
allFiles = FileNames["*.htm*", {"*"}, Infinity];
plainHTMLBucket = "";
Do[
plainHTML = Import[filename, "Plaintext"];
plainHTMLBucket = AppendTo[plainHTMLBucket, plainHTML];
, {filename, allFiles}];
Export[exportPrefix <> dirname <> ".txt", plainHTMLBucket];
Print["We Have Reached Here"];
, {dirname, dirlist}];
What's wrong with it from my perspective? Besides being messy, it's my workaround: I would much rather have all the files separated rather than one big one - i.e. take each import and export as a separate file, but in a directory called 'directory1' albeit somewhere else. The problem is when it comes to mirroring these directories (the directories don't exist, but I am having trouble using CreateDirectory[] to dynamically do so).
My apologies for the confusion here - I know it shows with this question..

The following code might do the trick:
mapFileNames[source_, filenames_, target_] :=
Module[{depth = FileNameDepth[source]}
, FileNameJoin[{target, FileNameDrop[#, depth]}]& /# filenames
]
htmlTreeToPlainText[source_, target_] :=
Module[{htmlFiles, textFiles, targetDirs}
, htmlFiles = FileNames["*.html", source, Infinity]
; textFiles = StringReplace[
mapFileNames[source, htmlFiles, target]
, f__~~".html"~~EndOfString :> f~~".txt"
]
; targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /# textFiles]
; If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]]
; Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]
; Scan[
Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &
, Transpose[{htmlFiles, textFiles}]
]
]
Example use (warning: the target directory will be deleted first!):
htmlTreeToPlainText["/users/me/web", "/users/me/desktop/bucket"]
How It Works
The various Mathematica FileName... functions are helpful in this context. First, we start by defining the helper function mapFileNames that takes a source directory, a list of file names that lie within the source directory, and a target directory. It returns a list of file paths that name the corresponding locations underneath the target directory.
mapFileNames[source_, filenames_, target_] :=
Module[{depth = FileNameDepth[source]}
, FileNameJoin[{target, FileNameDrop[#, depth]}]& /# filenames
]
The function uses FileNameDrop to drop the leading source path elements from each filename and FileNameJoin to prepend the target path onto the front of each result. The number of leading elements to drop is determined by applying FileNameDepth to the source path.
For example:
In[83]:= mapFileNames["/a/b", {"/a/b/x.txt", "/a/b/c/y.txt"}, "/d"]
Out[83]= {"/d/x.txt", "/d/c/y.txt"}
Using this function, we can convert a list of HTML file paths under a source directory (source) into corresponding list of text file paths under the target directory (target):
htmlFiles = FileNames["*.html", source, Infinity]
textFiles = StringReplace[
mapFileNames[source, htmlFiles, target]
, f__~~".html"~~EndOfString :> f~~".txt"
]
These statements retrieve the list of HTML files, map them to the target directory, and then change the file extension from .html to .txt. We can now extract the necessary directory names from the resulting text files:
targetDirs = DeleteDuplicates[FileNameDrop[#, -1]& /# textFiles]
Again FileNameDrop is used, this time to drop the filename portion from each text file's path.
Next, we need to delete the target directory (if it already exists) and create the new required directories:
If[FileExistsQ[target], DeleteDirectory[target, DeleteContents -> True]]
Scan[CreateDirectory[#, CreateIntermediateDirectories -> True]&, targetDirs]
We can now perform the HTML-to-text transformation, safe in the knowledge that the target directories already exist:
Scan[
Export[#[[2]], Import[#[[1]], "Plaintext"], "Text"] &
, Transpose[{htmlFiles, textFiles}]
]

To set the current directory, do something like
SetDirectory["~/Desktop/"]
Now, suppose I wish to obtain a list of all directories in the current directory. I can do
dirs=Pick[
#,
(FileType[#] == Directory) & /# #
] &#FileNames[]
which returns a list of the names of all the directories under the current directory that you've set earlier (I use nested pure functions which may be confusing...). You can then do fn to each of the dirs by Scan[fn,dirs]. So, you could assign the Pick[] construct to a function, then use it to recurse down your tree.
This is straightforward but I am not sure it's what you want. Maybe you could be a little more explicit on what you're after so I/we do not sit down and code the wrong thing.

Related

How to find duplicate directories

Let create some testing directory tree:
#!/bin/bash
top="./testdir"
[[ -e "$top" ]] && { echo "$top already exists!" >&2; exit 1; }
mkfile() { printf "%s\n" $(basename "$1") > "$1"; }
mkdir -p "$top"/d1/d1{1,2}
mkdir -p "$top"/d2/d1some/d12copy
mkfile "$top/d1/d12/a"
mkfile "$top/d1/d12/b"
mkfile "$top/d2/d1some/d12copy/a"
mkfile "$top/d2/d1some/d12copy/b"
mkfile "$top/d2/x"
mkfile "$top/z"
The structure is: find testdir \( -type d -printf "%p/\n" , -type f -print \)
testdir/
testdir/d1/
testdir/d1/d11/
testdir/d1/d12/
testdir/d1/d12/a
testdir/d1/d12/b
testdir/d2/
testdir/d2/d1some/
testdir/d2/d1some/d12copy/
testdir/d2/d1some/d12copy/a
testdir/d2/d1some/d12copy/b
testdir/d2/x
testdir/z
I need find the duplicate directories, but I need consider only files (e.g. I should ignore (sub)directories without files). So, from the above test-tree the wanted result is:
duplicate directories:
testdir/d1
testdir/d2/d1some
because in both (sub)trees are only two identical files a and b. (and several directories, without files).
Of course, I could md5deep -Zr ., also could walk the whole tree using perl script (using File::Find+Digest::MD5 or using Path::Tiny or like.) and calculate the file's md5-digests, but this doesn't helps for finding the duplicate directories... :(
Any idea how to do this? Honestly, I haven't any idea.
EDIT
I don't need working code. (I'm able to code myself)
I "just" need some ideas "how to approach" the solution of the problem. :)
Edit2
The rationale behind - why need this: I have approx 2.5 TB data copied from many external HDD's as a result of wrong backup-strategy. E.g. over the years, the whole $HOME dirs are copied into (many different) external HDD's. Many sub-directories has the same content, but they're in different paths. So, now I trying to eliminate the same-content directories.
And I need do this by directories, because here are directories, which has some duplicates files, but not all. Let say:
/some/path/project1/a
/some/path/project1/b
and
/some/path/project2/a
/some/path/project2/x
e.g. the a is a duplicate file (not only the name, but by the content too) - but it is needed for the both projects. So i want keep the a in both directories - even if they're duplicate files. Therefore me looking for a "logic" how to find duplicate directories.
Some key points:
If I understand right (from your comment, where you said: "(Also, when me saying identical files I mean identical by their content, not by their name)" , you want find duplicate directories, e.g. where their content is exactly the same as in some other directory, regardless of the file-names.
for this you must calculate some checksum or digest for the files. Identical digest = identical file. (with great probability). :) As you already said, the md5deep -Zr -of /top/dir is a good starting point.
I added the -of, because for such job you don't want calculate the contents of the symlinks-targets, or other special files like fifo - just plain files.
calculating the md5 for each file in 2.5TB tree, sure will take few hours of work, unless you have very fast machine. The md5deep runs a thread for each cpu-core. So, while it runs, you can make some scripts.
Also, consider run the md5deep as sudo, because it could be frustrating if after a long run-time you will get some error-messages about unreadable files, only because you forgot to change the files-ownerships...(Just a note) :) :)
For the "how to":
For comparing "directories" you need calculate some "directory-digest", for easy compare and finding duplicates.
The one most important thing is realize the following key points:
you could exclude directories, where are files with unique digests. If the file is unique, e.g. has not any duplicates, that's mean that is pointless checking it's directory. Unique file in some directory means, that the directory is unique too. So, the script should ignore every directory where are files with unique MD5 digests (from the md5deep's output.)
You don't need calculate the "directory-digest" from the files itself. (as you trying it in your followup question). It is enough to calculate the "directory digest" using the already calculated md5 for the files, just must ensure that you sort them first!
e.g. for example if your directory /path/to/some containing only two files a and b and
if file "a" has md5 : 0cc175b9c0f1b6a831c399e269772661
and file "b" has md5: 92eb5ffee6ae2fec3ad71c777531578f
you can calculate the "directory-digest" from the above file-digests, e.g. using the Digest::MD5 you could do:
perl -MDigest::MD5=md5_hex -E 'say md5_hex(sort qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
and will get 3bc22fb7aaebe9c8c5d7de312b876bb8 as your "directory-digest". The sort is crucial(!) here, because the same command, but without the sort:
perl -MDigest::MD5=md5_hex -E 'say md5_hex(qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
produces: 3a13f2408f269db87ef0110a90e168ae.
Note, even if the above digests aren't the digests of your files, but they're will be unique for every directory with different files and will be the same for the identical files. (because identical files, has identical md5 file-digest). The sorting ensures, that you will calculate the digest always in the same order, e.g. if some other directory will contain two files
file "aaa" has md5 : 92eb5ffee6ae2fec3ad71c777531578f
file "bbb" has md5 : 0cc175b9c0f1b6a831c399e269772661
using the above sort and md5 you will again get: 3bc22fb7aaebe9c8c5d7de312b876bb8 - e.g. the directory containing same files as above...
So, in such way you can calculate some "directory-digest" for every directory you have and could be ensured that if you get another directory digest 3bc22fb7aaebe9c8c5d7de312b876bb8 thats means: this directory has exactly the above two files a and b (even if their names are different).
This method is fast, because you will calculate the "directory-digests" only from small 32bytes strings, so you avoids excessive multiple file-digest-caclulations.
The final part is easy now. Your final data should be in form:
3a13f2408f269db87ef0110a90e168ae /some/directory
16ea2389b5e62bc66b873e27072b0d20 /another/directory
3a13f2408f269db87ef0110a90e168ae /path/to/other/directory
so, from this is easy to get: the
/some/directory and the /path/to/other/directory are identical, because they has identical "directory-digests".
Hm... All the above is only a few lines long perl script. Probably would be faster to write here directly the perl-script as the above long textual answer - but, you said - you don't want code... :) :)
A traversal can identify directories which are duplicates in the sense you describe. I take it that this is: if all files in a directory are equal to all files of another then their paths are duplicates.
Find all files in each directory and form a string with their names. You can concatenate the names with a comma, say (or some other sequence that is certainly not in any names). This is to be compared. Prepend the path to this string, so to identify directories.
Comparison can be done for instance by populating a hash with keys being strings with filenames and path their values. Once you find that a key already exists you can check the content of files, and add the path to the list of duplicates.
The strings with path don't have to be actually formed, as you can build the hash and dupes list during the traversal. Having the full list first allows for other kinds of accounting, if desired.
This is altogether very little code to write.
An example. Let's say that you have
dir1/subdir1/{a,b} # duplicates (files 'a' and 'b' are considered equal)
dir2/subdir2/{a,b}
and
proj1/subproj1/{a,b,X} # NOT duplicates, since there are different files
proj2/subproj2/{a,b,Y}
The above prescription would give you strings
'dir1/subdir1/a,b',
'dir2/subdir2/a,b',
'proj1/subproj1/a,b,X',
'proj2/subproj2/a,b,Y';
where the (sub)string 'a,b' identifies dir1/subdir1 and dir2/subdir2 as duplicates.
I don't see how you can avoid a traversal to build a system that accounts for all files.
The procedure above is the first step, not handling directories with files and subdirectories.
Consider
dirA/ dirB/
a b sdA/ a X sdB/
c d c d
Here the paths dirA/sdA/ and dirB/sdB/ are duplicates by the problem description but the whole dirA/ and dirB/ are distinct. This isn't shown in the question but I'd expect it to be of interest.
The procedure from the first part can be modified for this. Iterate through directories, forming a path component at every step. Get all files in each, and all subdirectories (if none we are done). Append the comma-separated file list to the path component (/sdA/). So the representation of the above is
'dirA/sdA,a,b/c,d', 'dirB/sdB,a,X/c,d'
For each file-list substring (c,d) found to already exist we can check its path against the existing one, component by component. Now a hash with keys like c,d won't do since this example has the same file-list for distinct hierarchies, but a modified (or other) data structure is needed.
Finally, there may be more subdirectories parallel to sdA (say sdA2). We care only for its own path, but except for the parallel files (a,b, in that component of the path dirA/sdaA2,a,b/). So keep in mind all bottom-level file-lists (c,d) with their paths and, if file-lists are equal and paths are of same length, check whether their paths have a,b file-lists equal in each path component.
I don't know whether this is a workable solution for you, but I'd expect "near-duplicates" to be rare -- the backup is either a duplicate or not. So there may not be much need to handle futher edge-cases in complex sprawling hierarchies. This procedure should be at least a useful pre-selection mechanism, that would greatly reduce the need for further work.
This assumes that equal file-names very likely indicate equal files. A part of that is my expectation that if a file was even just renamed it still cannot be considered a duplicate. If this is not so this approach won't work and one would need something along the lines of the answer by jm666.
I make a tool which searches duplicate folders.
https://github.com/un1t/dirdups
dirdups testdir -i 1
-i 1 option consider folders as duplicates if they have at least 1 file in common. Without this option default value is 10.
In your case it will find the following directories:
testdir/d1/d12/
testdir/d2/d1some/d12copy/

copy files from different locations and with different names into one target directory

i really looke for solutions for this , but all solutions just assume that either the file names are similar (like file1,file2, file 3...) so a regular expression can be used, or a FIND can look for the files with a certain name. But in my case i have different file names:
assume i have
file1 in E:\dir1\, input99 in C:\dir2 , result3 in F:\dir3
so these three files reside in different locations, different drive names, but still local
is it possible to use cp with absolute pathnames ?
like:
cp /dir1/file1, /dir2/input99, /dir3/result3 /targetdirectory
cp accepts a list of source files (specified either as absolute paths or relative paths) and copies all of them to the target directory (the last parameter). So:
cp E:\dir1\file1 C:\dir2\input99 F:\dir3\result3 C:\targetdirectory

Load all data files in a directory in Octave

I need to load all data in a directory into octave (no matter what their filenames are), so that the data from separate files are loaded into separate matrices. How can I do that?
I've tried to use dir and glob and then use a for loop but I don't know how to get matrices from cells.
I'm not 100% sure of your question. When you mention getting matrices from cells I'm guessing your problem is extracting the filename from the output of readir and glob. If that's so, you can get the names with filenames(1) (if you use {} to index a cell array you get another cell array).
filelist = readdir (pwd)
for ii = 1:numel(filelist)
## skip special files . and ..
if (regexp (filelist{ii}, "^\\.\\.?$"))
continue;
endif
## load your file
load filelist{ii}
## do your maths
endfor
You can use a struct on the load line if the filenames are good data.(filelist{ii}) = load filelist{ii}.
The answer by carandraug is great, I only want to specify that in some Octave versions, the load line may need to be written as:
load (filelist{ii})

How to read images from folders in matlab

I have six folders like this >> Images
and each folder contains some images. I know how to read images in matlab BUT my question is how I can traverse through these folders and read images in abc.m file (this file is shown in this image)
So basically you want to read images in different folders without putting all of the images into one folder and using imread()? Because you could just copy all of the images (and name them in a way that lets you know which folder they came from) into a your MATLAB working directory and then load them that way.
Use the cd command to change directories (like in *nix) and then load/read the images as you traverse through each folder. You might need absolute path names.
The easiest way is certainly a right clic on the forlder in matlab and "Add to Path" >> "Selected Folders and Subfolders"
Then you can just get images with imread without specifying the path.
if you know the path to the image containing directory, you can use dir on it to list all the files (and directories) in it. Filter the files with the image extension you want and voila, you have an array with all the images in the directory you specified:
dirname = 'images';
ext = '.jpg';
sDir= dir( fullfile(dirname ,['*' ext]) );;
sDir([sDir.isdir])=[]; % remove directories
% following is obsolete because wildcarded dir ^^
b=arrayfun(#(x) strcmpi(x.name(end-length(ext)+1:end),ext),sDir); % filter on extension
sFiles = sDir(b);
You probably want to prefix the name of each file with the directory before using:
sFileName(ii) = fullfile(dirname, sFiles(ii));
You can process this resulting files as you want. Loading all the files for example:
for ii=1:numel(sFiles)
data{i}=imread(sFiles(ii).name)
end
If you also want to recurse the subdirectories, I suggest you take a look at:
How to get all files under a specific directory in MATLAB?
or other solutions on the FEX:
http://www.mathworks.com/matlabcentral/fileexchange/8682-dirr-find-files-recursively-filtering-name-date-or-bytes
http://www.mathworks.com/matlabcentral/fileexchange/15505-recursive-dir
EDIT: added Amro's suggestion of wildcarding the dir call

Ruby FTP Separating files from Folders

I'm trying to crawl FTP and pull down all the files recursively.
Up until now I was trying to pull down a directory with
ftp.list.each do |entry|
if entry.split(/\s+/)[0][0, 1] == "d"
out[:dirs] << entry.split.last unless black_dirs.include? entry.split.last
else
out[:files] << entry.split.last unless black_files.include? entry.split.last
end
But turns out, if you split the list up until last space, filenames and directories with spaces are fetched wrong.
Need a little help on the logic here.
You can avoid recursion if you list all files at once
files = ftp.nlst('**/*.*')
Directories are not included in the list but the full ftp path is still available in the name.
EDIT
I'm assuming that each file name contains a dot and directory names don't. Thanks for mentioning #Niklas B.
There are a huge variety of FTP servers around.
We have clients who use some obscure proprietary, Windows-based servers and the file listing returned by them look completely different from Linux versions.
So what I ended up doing is for each file/directory entry I try changing directory into it and if this doesn't work - consider it a file :)
The following method is "bullet proof":
# Checks if the give file_name is actually a file.
def is_ftp_file?(ftp, file_name)
ftp.chdir(file_name)
ftp.chdir('..')
false
rescue
true
end
file_names = ftp.nlst.select {|fname| is_ftp_file?(ftp, fname)}
Works like a charm, but please note: if the FTP directory has tons of files in it - this method takes a while to traverse all of them.
You can also use a regular expression. I put one together. Please verify if it works for you as well as I don't know it your dir listing look different. You have to use Ruby 1.9 btw.
reg = /^(?<type>.{1})(?<mode>\S+)\s+(?<number>\d+)\s+(?<owner>\S+)\s+(?<group>\S+)\s+(?<size>\d+)\s+(?<mod_time>.{12})\s+(?<path>.+)$/
match = entry.match(reg)
You are able to access the elements by name then
match[:type] contains a 'd' if it's a directory, a space if it's a file.
All the other elements are there as well. Most importantly match[:path].
Assuming that the FTP server returns Unix-like file listings, the following code works. At least for me.
regex = /^d[r|w|x|-]+\s+[0-9]\s+\S+\s+\S+\s+\d+\s+\w+\s+\d+\s+[\d|:]+\s(.+)/
ftp.ls.each do |line|
if dir = line.match(regex)
puts dir[1]
end
end
dir[1] contains the name of the directory (given that the inspected line actually represents a directory).
As #Alex pointed out, using patterns in filenames for this is hardly reliable. Directories CAN have dots in their names (.ssh for example), and listings can be very different on different servers.
His method works, but as he himself points out, takes too long.
I prefer using the .size method from Net::FTP.
It returns the size of a file, or throws an error if the file is a directory.
def item_is_file? (item)
ftp = Net::FTP.new(host, username, password)
begin
if ftp.size(item).is_a? Numeric
true
end
rescue Net::FTPPermError
return false
end
end
I'll add my solution to the mix...
Using ftp.nlst('**/*.*') did not work for me... server doesn't seem to support that ** syntax.
The chdir trick with a rescue seems expensive and hackish.
Assuming that all files have at least one char, a single period, and then an extension, I did a simple recursion.
def list_all_files(ftp, folder)
entries = ftp.nlst(folder)
file_regex = /.+\.{1}.*/
files = entries.select{|e| e.match(file_regex)}
subfolders = entries.reject{|e| e.match(file_regex)}
subfolders.each do |subfolder|
files += list_all_files(ftp, subfolder)
end
files
end
nlst seems to return the full path to whatever it finds non-recursively... so each time you get a listing, separate the files from the folders, and then process any folder you find recrsively. Collect all the file results.
To call, you can pass a starting folder
files = list_all_files(ftp, "my_starting_folder/my_sub_folder")
files = list_all_files(ftp, ".")
files = list_all_files(ftp, "")
files = list_all_files(ftp, nil)

Resources