Disk space required for unix sort - sorting

I am currently doing a UNIX sort (via GitBash on a Windows machine) of a 500GB text file. Due to running out of space on the main disk, I have used the -T option to direct the temp files to a disk where I have enough space to accommodate the entire file. The thing is, I've been watching the disk space and apparently the temp files are already in excess of what the original file was. I don't know how much further this is going to go, but I'm wondering if there is a rule by which I can predict how much space I will need for temp files.

I'd batch it manually as described in this unix.SE answer.
Find some very basic queries that will divide your content into chunks that are small enough to be sorted. For example, if it's a file of words, you could create queries like grep ^a …, grep ^b …, and so on. Some items may need more granularity than others.
You can script that like:
#!/bin/bash
for char1 in other {0..9} {a..z}; do
out="/tmp/sort.$char1.xz"
echo "Extracting lines starting with '$char1'"
if [ "$char1" = "other" ]; then char1='[^a-z0-9]'; fi
grep -i "^$char1" *.txt |xz -c0 > "$out"
unxz -c "$out" |sort -u >> output.txt || exit 1
rm "$out"
done
echo "It worked"
I'm using xz -0 because it's almost as fast as gzip's default gzip -6 yet it's vastly better at conserving space. I omitted it from the final output in order to preserve the exit value of sort -u, but you could instead use a size check (iirc, sort fails with zero output) and then use sort -u |xz -c0 >> output.txt.xz since the xz (and gzip) container lets you concatenate archives (I've written about that before too).
This works because the output of each grep run is already sorted (0 is before 1, which is before a, etc.), so the final assembly doesn't need to run through sort (note, the "other" section will be slightly different since some non-alphanumeric characters are before the numbers, others are between numbers and letters, and others still are after the letters. You can also remove grep's -i flag and additionally iterate through {A..Z} to be case sensitive). Each individual iteration obviously still needs to be sorted, but hopefully they're manageable.
If the program exits before completing all iterations and saying "It worked" then you can edit the script with a more discrete batch for the last iteration it tried. Remove all prior iterations since they're successfully saved in output.txt.

Related

Append to list of files in bash

so I'm trying to get a simple bash script to continuously read a directory and update a list of files to play through a command. However, I'm having some trouble thinking out the logic in it. What I need to do is put the current items in the directory into the list, have each item in the directory run through a program, and when a new item comes in, just append it to the list. I'm attempting to use inotifywait but can't seem to think of the proper logic. I may need it to run in the background, as the process that is running on these files will run before inotifywait is read again, at which point it will not pick up any new files that have been added as it only checks when it runs. Here's the code so hopefully it makes more sense.
#!/bin/bash
#Initial check to see if files are converted.
if [ ! -d "/home/pi/rpitx/converted" ]; then
echo "Converted directory does not exist, cannot play!"
exit 1
fi
CYAN='\e[36m'
NC='\e[39m'
LGREEN='\e[92m'
#iterate through directory first and act upon each item
for f in $FILES
do
echo -e "${CYAN}Now playing ${f##*/}...${NC}"
#Figure out a way to always watch directory even when it is playing
inotifywait -m /home/pi/rpitx/converted -e create -e moved_to |
while read path action file; do
echo -e "${LGREEN}New file found: ${CYAN}${file}${NC}"
FILES+=($file)
done
# take action on each file. $f store current file name
sudo ./rpitx -m RF -i "${f}" -f 101100
done
exit 0
So for example. if rpitx is currently playing something, and a file is converted, it won't pick up the latest file and add it to the list, nor will it make it since it's always reading. Is there a way to get inotifywait to run in the background of this script somehow? Thanks.
This is actually quite a difficult problem to get 100% perfect, but it is possible to get pretty close.
It is easy to get all the files in a directory, and it is easy to use inotifywait to get iteratively informed of new files being placed into the directory. The issue is getting the two to be consistent. If inotifywait isn't started until all the files have been processed (or even just listed), then you might miss new files created between the listing and the invocation of inotifywait. If, on the other hand, you start inotifywait first, then a file created after the invocation of inotifywait and the extraction of the current file list will be listed twice.
Since it is easier to filter duplicates than notice orphans, the recommended approach is the second one.
As a first approximation, we could ignore the duplicate problem on the assumption that the window of vulnerability is pretty short and so it is probably unlikely to happen. This simplifies the code, but it's not that difficult to track and eliminate duplicates: we could, for example, store each filename as the key in an associative array, ignoring the file if the key already exists.
We need three processes: one to execute inotifywait; one to produce the list of initial files; and one to handle each file as it is identified. So the basic structure of the code will be:
list_new_files |
{ list_existing_files; pass_through; } |
while read action file; do
handle -r "$action" "$file"
done
Note that the second process first produces the existing files, and then calls pass_through, which reads from standard input and writes to standard output, thus passing through the files being discovered by list_new_files. Since pipes have a finite capacity, it is possible that the execution of list_existing_files will block a few times (if there are lots of existing files and handling them takes a long time), so when pass_through finally gets executed, it could have quite a bit of queued-up input to pass through. That doesn't matter, unless the first pipe also fills up, which will happen if a large number of new files are created. And that still won't matter as long as inotifywait doesn't lose notifications while it is blocked on a write. (This may actually be a problem, since the manpage for inotifywait on my system includes in the "BUGS" section the note, "It is assumed the inotify event queue will never overflow." We could fix the problem by inserting another process which carefully buffers inotifywait's output, but that shouldn't be necessary unless you intend to flood the directory with lots of files.)
Now, let's examine each of the functions in turn.
list_new_files could be just the call to inotifywait from your original script:
inotifywait -m /home/pi/rpitx/converted -e create -e moved_to
Listing existing files is also easy. Here's one simple solution:
printf "%s\n" /home/pi/rpitx/converted/*
However, that will print out the full file path, which is different from the output from inotifywait. To make them the same, we cd into the directory in order to do the listing. Since we might not actually want to change the working directory, we use a subshell by surrounding the commands inside parentheses:
( cd /home/pie/rpitx/converted; printf "%s\n" *; )
The printf just prints its arguments each on a separate line. Since glob-expansions are not word-split or recursively glob-expanded, this is safe against whitespace or metacharacters in filenames, except newline characters. Filenames with newline characters are pretty rare; for now, I'll ignore the issue but I'll indicate how to handle it at the end.
Even with the change indicated above, the output from these two commands is not compatible: the first one outputs three things on each line (directory, action, filename), and the second one just one thing (the filename). In the listing below, you'll see how we modify the format to printf and introduce a format for inotifywait in order to make the outputs fully compatible, with the "action" for existing files set to EXISTING.
pass_through could, in theory, just be cat, and that's how I've coded it below. However, it is important that it operate in line-buffered mode; otherwise, nothing will happen until "enough" files have been written by list_existing_files. On my system, cat in this configuration works perfectly; if that doesn't work for you or you don't want to count on it, you could write it explicitly as a while read loop:
pass_through() {
while read -r line; do echo "$line"; done
}
Finally, handle is essentially the code from the original post, but modified a bit to take the new format into account, and to do the right thing with action EXISTING.
# Colours. Note the use of `$'...'` to actually store the code,
# thereby avoiding the need to later reinterpret backslash sequences
CYAN=$'\e[36m'
NC=$'\e[39m'
LGREEN=$'\e[92m'
converted=/home/pi/rpitx/converted
list_new_files() {
inotifywait -m "$converted" -e create -e moved_to --format "%e %f"
}
# Note the use of ( ) around the body instead of { }
# This is the same as `{( ... )}'; it makes the `cd` local to the function.
list_existing_files() (
cd "$converted"
printf "EXISTING %s\n" *
)
# Invoked as `handle action filename`
handle() {
case "$1" in
EXISTING)
echo "${CYAN}Now playing ${2}...${NC}"
;;
*)
echo "${LGREEN}New file found: ${CYAN}${file}${NC}"
;;
esac
sudo ./rpitx -m RF -i "${f}" -f 101100
}
# Put everything together
list_new_files |
{ list_existing_files; cat; } |
while read -r action file; do handle "$action" "$file"; done
What if we thought a filename might have a newline character in it? There are two "safe" characters which could be used to delimit the filenames, in the sense that they cannot appear inside a filename. One is /, which can obviously appear in a path, but cannot appear in a simple filename, which is what we're working with here. The other one is the NUL character, which cannot appear inside a filename at all, but can sometimes be a bit annoying to deal with.
Normally, faced with this problem, we would use a NUL, but that depends on the various utilities we're using allowing the separation of data with NUL instead of newline. That's not the case for inotifywait, which always outputs a newline after a notification line. So in this case it seems simpler to use a /. First we modify the formats:
inotifywait -m "$converted" -e create -e moved_to --format "%e %f/"
printf "%s/\n" *
Now, when we're reading the lines, we need to read until we find a line ending with / (and remember to remove it). read doesn't allow two-character line terminators, so we need to accumulate the lines ourselves:
while read -r action file; do
# If file doesn't end with a slash, we need to read another line
while [[ file != */ ]] && read -r line; do
file+=$'\n'"$line"
done
# Remember to remove the trailing slash
handle "$action" "${file%/}"
done

Bash remove half of the files in the directory

I am trying to remove half of the files in the corpora directory to make my spam filter trained a little bit faster and, in the future, save some space. Normally I would do it by trial and error, but since these files took a while to download etc, plus it's shell (which I am obviously not an expert in), I do not want to mess this up.
I would try something like this:
ls *.* > list
for i in 'cat list'; do rm -f i++; done
But im pretty sure i++ like this isn't a proper way to skip every second item in the list. Perhaps I should use some other loop?
Secondly, there are two types of files in that directory:
0000.* to 1500.*
0000.* to 0250.*
I want to delete half of the first type and half of the second type. Since they're probably sorted a standard way in the list, meaning that from 0000.* to 0250.* they interweave and then after 0.250.* the first type remains only, it might be deleted the wrong way (all from the second type could be deleted).
So IMHO, I should do it like this:
Both types delete 0000.*
Both types skip 0001.*
Both types delete 0002.*
etc.
Do you guys have an idea how to delete these files like above?
If you just want to delete every second file, then you can use a simple alternating state machine. Since *.* will give you the files in sorted order, you can just delete every second file, with something like:
del=1
for fspec in *.* ; do
if [[ ${del} -eq 1 ]] ; then
del=0
echo rm ${fspec}
else
echo ok ${fspec}
del=1
fi
done
If you run that script you'll get a series of alternating lines saying:
rm file1
ok file2
rm file3
ok file4
and so on.
Once you're happy with the behaviour, you can comment out the ok line entirely and remove the echo from the rm line.
However, if your intent is to actually delete all files of the form NNNN.*, where NNNN is in the set {0000, 0002, 0004, ..., 9998}, that can be done more concisely (again, remove the echo when you're happy):
for id in {0000..9998..2} ; do
echo rm -f ${id}.*
done
That 0000 will ensure the strings are four digits long, assuming you have a recent enough bash. If it doesn't, you can just use:
for id in {0..9998..2} ; do
echo rm -f $(printf "%04d" ${id}).*
done
Regardless of the method you choose, I'd be making a backup of the directory you're working in before testing as well.

generating every possible letter and number combination 8 and 63 characters long. in bash

How would a generate every possible letter and number combination into a word list something kind of like "seq -w 0000000000-9999999999 > word-list.txt" or like "echo {a..z}{0..9}{Z..A}", but I need to include letters and length options. Any help? And side info, this will be run on a GTX980 so it won't be too slow, but I am worried about the storage issue, If you have any solutions please let me know.
file1.sh:
#!/bin/bash
echo \#!/bin/bash > file2.sh
for i in $(seq $1 $2)
do
echo -n 'echo ' >> file2.sh
for x in $(seq 1 $i)
do
echo -n {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,\
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,\
0,1,2,3,4,5,6,7,8,9} >> file2.sh
done
echo
done
I don't think it's a good idea, but this ought to generate a file which when executed will generate all possible alphanumeric sequences with lengths between the first and second arguments inclusive. Please don't actually try it for 8 through 64, I'm not sure what will happen but there's no way it will work. Sequences of the same length will be space separated, sequences of different lengths will be separated by newlines. You can send it through tr if you want to replace the spaces with newlines. I haven't tested it and I clearly didn't bother with silly things like input validation, but it seems about right.
If you're doing brute force testing, there should be no need to save every combination somewhere, given how easy and (comparatively) quick it is to generate on the fly. You might also consider looking at one of the many large password files that's available on the internet.
P.S. You should try doing the math to figure out approximately how much you would have to spend on 1TB hard drives to store the file you wanted.

In Bash, how do you compare two files (not case sensitive)

I figured out how to compare two files and use the status code of that to see if the files are the same or not. The problem is, it only works if the comparison is case sensitive. I used the status code of the cmp command.
I suspect I am to use globbing (i.e. "[Aa][Bb][Cc][and so on...]"). But I don't know how to implement this into the cmp command.
There is utility for comparing 2 files in shell.
diff -i file1 file2
Much faster than diff is to use cmp, after normalizing for case:
#!/bin/bash
# ^-- must not be /bin/sh, as process substitution is a bash/ksh/zsh feature
if cmp -s <(tr [a-z] [A-Z] <file1) <(tr [a-z] [A-Z] <file2); then
echo "files are the same"
else
echo "files differ"
fi
cmp -s is particularly fast, as it can exit as soon as it finds the first difference.
This is also much more memory-efficient -- it streams content through the tr operation (storing no more than one buffer's worth of each file at any given time), into cmp (which, likewise, needs to only store enough to immediately buffer and compare). Compare to a diff-type algorithm, which needs to be able to seek around in files to find similar parts, and which thus has IO or memory requirements well beyond the O(1) usage of cmp.

how to reuse stdout output without saving it to physical disk file

I have a for-loop, like the following:
for inf from $filelist; do
for ((i=0; i<imax; ++i)); do
temp=`<command_1> $inf | <command_2>`
eval set -A array -- $temp
...
done
...
done
Problem is, command_1 a bit time consuming and its output is a bit large (900MB is the highest, depending on how big the input file is). So, I modified the script to:
outf="./temp"
for inf from $filelist; do
<command_1> $inf -o $outf
for ((i=0; i<imax; ++i)); do
temp=`cat $outf | <command_2>`
eval set -A array -- $temp
...
done
...
done
There is a little performance improvement, but not so much as I want, probably because disk I/O is a performance bottle-neck as well.
Just curious if there is a way to save the stdout output of command_1, so that I could reuse it without saving it to a physical disk file?
don't use pipelines inside nested loops
Based on new comments and another look at the original question, I would strongly recommend against using a pipeline processing large amounts of data inside a nested loop. Shell pipelines are far from efficient, and incur lots of process overhead.
Look at the original problem, this involves looking into the contributions of command_1 and command_2, and see if you could solve this in another way.
That said: here's the original answer:
In the shell there are two ways of storing data: either in a shell variable, or in a file. You might try to store that file in a memory based filesystem, like /dev/shm on linux or tmpfs in Solaris.
You might also analyse command_1 and command_2 for optimisations. Is there anything in the output of command_1 that's not needed by command_2? Try to put a filter between the two.
Example:
command_1 | awk '{ print $2 }' | command_2
(Assuming command_2 only needs column 2 of command_1's output.)

Resources