Cannot mount a splitted ISO after concatenation - installation

I have a splitted ISO (7 files of 7GB) which I concatenate with the terminal as
cat /Volumes/Blah/*.iso > /Volumes/Blah/concatenated.iso
I can see that the concatenated.iso file has size 7 x 7GB, but when I use any mounting software afterwards on Mac OSX, (tried Keka & Disk Utility), the mounted disk shows only a size of 7GB & seems to only contain the first one. What am I doing wrong here?

Assuming the split parts do not have headers of their own, and can simply be concatenated:
You need to ensure they are concatenated in the correct order. The order of cat /Volumes/Blah/*.iso is essentially undefined, it's likely not ordered alphabetically. Run echo /Volumes/Blah/*.iso to see what order you actually end up.
So list all the files manually in the correct order, like cat /Volumes/Blah/foo_1.iso cat /Volumes/Blah/foo_2.iso cat /Volumes/Blah/foo_3.iso. If you have a very large number of files you can employ a sub-shell with a for loop but if this is a one-time job you're probably faster just copy/paste/modify-ing the path of each file manually.

Related

Script to append files with same part of the name

I have a bunch of files (> 1000), which content has columns of numbers separates by space. I would like to reduce the number of files by appending the content of groups of them in one file.
All the files start with "*time_NUMBER*" followed by a number, and the rest of the filename (*pow_....txt*). For example : *time_0.6pow_0.1-173.txt*
I would like to append the files with the same NUMBER in a single file and make it with a script since I got ~70 different NUMBERs.
I have found
cat time_0.6pow_*.txt > time_0.6.txt
it works but would like to make a script for all the possible NUMBERs.
Regards
You can do it like this:
for fName in time_*pow_*.txt; do
s="${fName#time_}"
cat "$fName" >> time_"${s%%pow*}".txt
done

Bash script that creates files of a set size

I'm trying to set up a script that will create empty .txt files with the size of 24MB in the /tmp/ directory. The idea behind this script is that Zabbix, a monitoring service, will notice that the directory is full and wipe it completely with the usage of a recovery expression.
However, I'm new to Linux and seem to be stuck on the script that generates the files. This is what I've currently written out.
today="$( date +¨%Y%m%d" )"
number=0
while test -e ¨$today$suffix.txt¨; do
(( ++number ))
suffix=¨$( printf -- %02d ¨$number¨ )
done
fname=¨$today$suffix.txt¨
printf ´Will use ¨%s¨ as filename\n´ ¨$fname¨
printf -c 24m /tmp/testf > ¨$fname¨
I'm thinking what I'm doing wrong has to do with the printf command. But some input, advice and/or directions to a guide to scripting are very welcome.
Many thanks,
Melanchole
I guess that it doesn't matter what bytes are actually in that file, as long as it fills up the temp dir. For that reason, the right tool to create the file is dd, which is available in every Linux distribution, often installed by default.
Check the manpage for different options, but the most important ones are
if: the input file, /dev/zero probably which is just an endless stream of bytes with value zero
of: the output file, you can keep the code you have to generate it
count: number of blocks to copy, just use 24 here
bs: size of each block, use 1MB for that

Text file search tool for Windows (command line) with an extremely large pattern list

Is there an efficient way to search a list of strings from another text file or from a piped output?
I have tried the following methods:
FINDSTR /G:patternlist.txt <filetocheck>
or
Some program whose output is piped to FINDSTR
SOMEPROGRAM | FINDSTR /G:patternlist.txt
Similarly, tried GREP from MSYS, UnixUtils, GNU package etc.,
GREP -w -F -f patternlist.txt <filetocheck>
or
Some program whose output is piped to GREP
SOMEPROGRAM | GREP -w -F -f patternlist.txt
For example, Pattern List file is a text file which contains one literal string per line.
For example
Patternlist.txt
65sM547P
Bu83842T
t400N897
s20I9U10
i1786H6S
Qj04404e
tTV89965
etc.,
And the file_to_be_checked contains similar texts, but there might be multiple words in a single line in some cases.
For example
Filetocheck.txt
3Lo76SZ4 CjBS8WeS
iI9NvIDC
TIS0jFUI w6SbUuJq joN2TOVZ
Ee3z83an rpAb8rWp
Rmd6vBcg
O2UYJOos
hKjL91CB
Dq0tpL5R R04hKmeI W9Gs34AU
etc.,
They work as expected if the number of pattern literals are less than 50000 and sometime works very slow upto 100000 patterns.
Also, the filetocheck.txt will contain upto 250000 lines and grows upto 30 MB in size.
The problem comes when the pattern file becomes larger than this. I have an instance of patternfile which is around 20 MB and contains 600000 string literals.
Matching this against a list or output of 250000 to 300000 lines of text literally stalls the processor.
I tried SIFT, and multiple other text search tools, but they just kill the system with the memory requirements and processor usage and make the system unresponsive.
I require a commandline based solution or utility which could help in achieving this task because this is a part of another big script.
I have tried multiple programs and methods to speed up, but all in vain like indexing the pattern file, sorting the file alphabetically etc.,.
Since the input will be from a program, there is no option to split the input file as well. It is all in one big piped command.
Example:
PASSWORDGEN | <COMMAND_TO_FILTER_KNOWN_PASSWORDS> >> FILTERED_OUTPUT
The above problem is in part where the system hangs or take very long time to filter the stdout stream or from a saved results file.
System configuration details if this will be any help:
I am running this on a modest 8 GB RAM, SATA HDD, Core i7 with Win 7 64bit and currently I do not have any better configuration available currently.
Any help in this issue is much appreciated.
I am also trying to find a solution if not create a specific code to achieve this (help appreciated in that sense as well.)

Time complexity of finding duplicate files in bash

I had to write a Bash script to delete duplicate files today, using their md5 hashes. I stored those hashes as files in a temporary directory:
for i in * ; do
hash=$(md5sum /tmp/msg | cut -d " " -f1) ;
if [ -f /tmp/hashes/$hash ] ;
then
echo "Deleted $i" ;
mv $i /tmp/deleted ;
else
touch /tmp/hashes/$hash ;
fi ;
done
It worked perfectly, but led me to wonder: is it a time-efficient way of doing that? I initially thought of storing the MD5 hashes in a file, but then I thought "no, because checking whether a given MD5 is in this file requires to re-read it entirely every time". Now, I wonder: is it the same when using the "create files in a directory" method? Has the Bash [ -f ] check linear, or quasi-constant complexity when there are lots of file in the same directory?
If it depends on the filesystem, what's the complexity on tmpfs?
I'm a fan of using the right tool for the job. In this case, you only want to see duplicate files. I've tested this against several thousand files at my disposal and rereading the file did not seem to have any problems. Plus I noticed that I have hundreds of duplicate files. When I store hashes in separate files and then process this large quantity of files, my system slowly creeps along after about 10,000 hash files in one directory. Having all of the hashes in a single file greatly sped this up.
# This uses md5deep. An alternate is presented later.
md5deep -r some_folder > hashes.txt
# If you do not have md5deep
find . -type f -exec md5sum \{\} \;
This gives you hashes of everything.
cut -b -32 hashes.txt | sort | uniq -d > dupe_hashes.txt
That will use cut to get the hash for each file, sort the hashes, then find any duplicated hashes. Those are written to dupe_hashes.txt without the filenames attached. Now we need to map hashes back to the files.
(for hash in $(cat dupe_hashes.txt); do
grep "^$hash" hashes.txt | tail -n +2 | cut -b 35-
done) > dupe_files.txt
This does not appear to run slowly for me. The Linux kernel does a very good job of keeping files like this in memory instead of reading them from the disk frequently. If you prefer to force this to be in memory, you could just use /dev/shm/hashes.txt instead of hashes.txt. I've found that it was unnecessary in my tests.
That gives you every file that's a duplicate. So far, so good. You'll probably want to review this list. If you want to list the original one as well, remove the tail -n +2 | bit from the command.
When you are comfortable that you can delete every listed file you can pipe things to xargs. This will delete the files in groups of 50.
xargs -L 50 rm < dupe_files.txt
I'll try to qualitatively answer how fast file existence tests are on tmpfs, and then I can suggest how you can make your whole program run faster.
First, tmpfs directory lookups rely (in the kernel) on directory entry cache hash table lookups, which aren't that sensitive to the number of files in your directory. They are affected, but sub-linearly. It has to do with the fact that properly-done hash table lookups take some constant time, O(1), regardless of the number of items in the hash table.
To explain, we can look at the work that is done by test -f, or [ -f X ], from coreutils (gitweb):
case 'e':
unary_advance ();
return stat (argv[pos - 1], &stat_buf) == 0;
...
case 'f': /* File is a file? */
unary_advance ();
/* Under POSIX, -f is true if the given file exists
and is a regular file. */
return (stat (argv[pos - 1], &stat_buf) == 0
&& S_ISREG (stat_buf.st_mode));
So it uses stat() on the filename directly. No directory listing is done explicitly by test, but the runtime of stat may be affected by the number of files in the directory. The completion time for the stat call will depend on the unterlying filesystem implementation.
For every filesystem, stat will split up the path into directory components, and walk it down. For instance, for the path /tmp/hashes/the_md5: first /, gets its inode, then looks up tmp inside it, gets that inode (it's a new mountpoint), then gets hashes inode, and finally then the test filename and its inode. You can expect the inodes all the way to /tmp/hashes/ to be cached because they are repeated at each iteration, so those lookups are fast and likely don't require disk access. Each lookup will depend on the filesystem the parent directory is on. After the /tmp/ portion, lookups happen on tmpfs (which is all in memory, except if you ever run out of memory and need to use swap).
tmpfs in linux relies on simple_lookup to obtain the inode of a file in a directory. tmpfs is located under its old name in the tree linux mm/shmem.c . tmpfs, much like ramfs, doesn't seem to be implementing data structures of its own to keep track of virtual data, it simply relies on VFS directory entry caches (under Directory Entry Caches).
Therefore, I suspect the lookup for a file's inode, in a directory, is as simple as a hash table lookup. I'd say that as long as all your temporary files fit in your memory, and you use tmpfs/ramfs, it doesn't matter how many files are there -- it's O(1) lookup each time.
Other filesystems like Ext2/3, however, will incur a penalty linear with the number of files present in the directory.
storing them in memory
As others have suggested, you may also store MD5s in memory by storing them in bash variables, and avoid the filesystem (and associated syscall) penalties. Storing them on a filesystem has the advantage that you could resume from where you left if you were to interrupt your loop (your md5 could be a symlink to the file whose digest matches, which you could rely on, on subsequent runs), but is slower.
MD5=d41d8cd98f00b204e9800998ecf8427e
let SEEN_${MD5}=1
...
digest=$(md5hash_of <filename>)
let exists=SEEN_$digest
if [[ "$exists" == 1 ]]; then
# already seen this file
fi
faster tests
And you may use [[ -f my_file ]] instead of [ -f my_file ]. The command [[ is a bash built-in, and is much faster than spawning a new process (/usr/bin/[) for each comparison. It will make an even bigger difference.
what is /usr/bin/[
/usr/bin/test and /usr/bin/[ are two different programs, but the source code for [ (lbracket.c) is the same as test.c (again in coreutils):
#define LBRACKET 1
#include "test.c"
so they're interchangeable.
The choice between reading the contents of a file containing hashes and finding a hash in a directory of filenames that are the hashes basically comes down to "is the kernel quicker at reading a directory or your program at reading a file". Both are going to involve a linear search for each hash, so you end up with much the same behaviour. You can probably argue that the kernel should be a little quicker, but the margin won't be great. Note that most often, the linear search will be exhaustive because the hash won't exist (unless you have lots of duplicate files). So, if you're processing a few thousand files, the searches will process a few million entries overall — it is quadratic behaviour.
If you have many hundreds or thousands of files, you'd probably do better with a two-level hierarchy — for example, a directory containing two-character sub-directories 00 .. FF, and then storing the rest of the name (or the full name) in the sub-directory. A minor variation of this technique is used in the terminfo directories, for example. The advantage is that the kernel only has to read relatively small directories to find whether the file is present or not.
I haven't "hashed" this out, but I'd try storing your md5sums in a bash hash.
See How to define hash tables in Bash?
Store the md5sum as the key, and if you want, the filename as the value. For each file, just see if the key already exists in the hash table. If so, you don't care about the value, but could use it to print out the original duplicate file's name. Then delete the current file (with the duplicate key). Not being a bash expert that's where I'd start looking.

Opposite of Linux Split

I have a huge file and I split the big file into several small chunks and divide and conquer. Now I have a folder contains a list of files like below:
output_aa #(the output file done: cat input_aa | python parse.py > output_aa)
output_ab
output_ac
output_ad
...
I am wondering is there a way to merge those files back together FOLLOWING THE INDEX ORDER:
I know I could do it by using
cat * > output.all
but I am more curious another magical command already exist comes with split..
The magic command would be:
cat output_* > output.all
There is no need to sort the file names as the shell already does it (*).
As its name suggests, cat original design was precisely to conCATenate files which is basically the opposite of split.
(*) Edit:
Should you use an (hypothetical ?) locale that use a collating order where the a-z order is not abcdefghijklmnopqrstuvwxyz, here is one way to overcome the issue:
LC_ALL=C "sh -c cat output_* > output.all"
There are other ways to concat files together, but there is no magical "opposite of split" in "linux".
Of course, talking about "linux" in general is a bit far fetched, as many distributions have different tools (most of them use a different shell already by default, like sh, bash, csh, zsh, ksh, ...), but if you're talking about debian based linux at least, I don't know of any distribution which would provide such a tool.
For sorting you can use the linux command "sort" ;
Also be aware that using ">" for redirecting stdout will override maybe existing contents, while ">>" will concat to an existing file.
I don't want to copycat, but still make this answer complete, so what jlliagre said about the cat command should also be considered of course (that "cat" was made to con-"cat" files, effectively making it possible to reverse the split command - but that's only provided you use the same ordering of files, so it's not exactly the "opposite of split", but will work that way in close to 100% of the cases (see comments under jlliagre answer for specifics))

Resources