In Bash, how do you compare two files (not case sensitive) - bash

I figured out how to compare two files and use the status code of that to see if the files are the same or not. The problem is, it only works if the comparison is case sensitive. I used the status code of the cmp command.
I suspect I am to use globbing (i.e. "[Aa][Bb][Cc][and so on...]"). But I don't know how to implement this into the cmp command.

There is utility for comparing 2 files in shell.
diff -i file1 file2

Much faster than diff is to use cmp, after normalizing for case:
#!/bin/bash
# ^-- must not be /bin/sh, as process substitution is a bash/ksh/zsh feature
if cmp -s <(tr [a-z] [A-Z] <file1) <(tr [a-z] [A-Z] <file2); then
echo "files are the same"
else
echo "files differ"
fi
cmp -s is particularly fast, as it can exit as soon as it finds the first difference.
This is also much more memory-efficient -- it streams content through the tr operation (storing no more than one buffer's worth of each file at any given time), into cmp (which, likewise, needs to only store enough to immediately buffer and compare). Compare to a diff-type algorithm, which needs to be able to seek around in files to find similar parts, and which thus has IO or memory requirements well beyond the O(1) usage of cmp.

Related

Renaming multiple files by adding an integer value of 1

I have multiple svg files....and I need to rename them by adding 1.
0.svg --> 1.svg
1.svg --> 2.svg
2.svg --> 3.svg
etc...
What would be the best way to do this using the linux terminal?
The trick is to process the files backwards so you don't overwrite existing files while renaming. Use parameter expansion to extract the numbers from the file names.
#!/bin/bash
files=(?.svg)
for (( i = ${#files[#]} - 1; i >= 0; --i )) ; do
n=${files[i]%.svg}
mv $n.svg $(( n + 1 )).svg
done
If the files can have names of different length (e.g. 9.svg, 10.svg) the solution will be more complex, as you need to sort the files numerically rather than lexicographically.
Considering the case that the filename numbers have multiple digits, please try the following:
while IFS= read -r num; do
new="$(( num + 1 )).svg"
mv -- "$num.svg" "$new"
done < <(
for f in *.svg; do
n=${f%.svg}
echo "$n"
done | sort -rn
)
This Shellcheck-clean code is intended to operate safely and cleanly no matter what is in the current directory:
#! /bin/bash -p
shopt -s nullglob # Globs that match nothing expand to nothing
shopt -s extglob # Enable extended globbing (+(...), ...)
# Put the file base numbers in a sparse array.
# (Bash automatically keeps such arrays sorted by increasing indices.)
sparse_basenums=()
for svgfile in +([0-9]).svg ; do
# Skip files with extra leading zeros (e.g. '09.svg')
[[ $svgfile == 0[0-9]*.svg ]] && continue
basenum=${svgfile%.svg}
sparse_basenums[$basenum]=$basenum
done
# Convert the sparse array to a non-sparse array (preserving order)
# so it can be processed in reverse order with a 'for' loop
basenums=( "${sparse_basenums[#]}" )
# Process the files in reverse (i.e. decreasing) order by base number
for ((i=${#basenums[*]}-1; i>=0; i--)) ; do
basenum=${basenums[i]}
mv -i -- "$basenum.svg" "$((basenum+1)).svg"
done
shopt -s nullglob prevents bad behaviour if the directory doesn't contain any files whose names are a decimal number followed by '.svg'. Without it the code would try to process a file called '+([0-9]).svg'.
shopt -s extglob enables a richer set of globbing patterns than the default. See the 'extglob' section in glob - Greg's Wiki for details.
The usefulness of sparse_basenums depends on the fact that Bash arrays can have arbitrary non-negative integer indices, that arrays with gaps in their indices are stored efficiently (sparse arrays), and that elements in arrays are always stored in order of increasing index. See Arrays (Bash Reference Manual) for more information.
The code skips files whose names have extra leading zeros ('09.svg', but not '0.svg') because it can't handle them safely as it is now. Trying to treat '09' as a number causes an error because it's treated as an illegal octal number. That is easily fixable, but there could still be problems if, for instance, you had both '9.svg' and '09.svg' (they would both naturally be renamed to '10.svg').
The code uses mv -i to prompt for user input in case something goes wrong and it tries to rename a file to one that already exists.
Note that the code will silently do the wrong thing (due to arithmetic overflow) if the numbers are too big (e.g. '99999999999999999999.svg'). The problem is fixable.

Bash split stdin by null and pipe to pipeline

I have a stream that is null delimited, with an unknown number of sections. For each delimited section I want to pipe it into another pipeline until the last section has been read, and then terminate.
In practice, each section is very large (~1GB), so I would like to do this without reading each section into memory.
For example, imagine I have the stream created by:
for I in {3..5}; do seq $I; echo -ne '\0';
done
I'll get a steam that looks like:
1
2
3
^#1
2
3
4
^#1
2
3
4
5
^#
When piped through cat -v.
I would like to pipe each section through paste -sd+ | bc, so I get a stream that looks like:
6
10
15
This is simply an example. In actuality the stream is much larger and the pipeline is more complicated, so solutions that don't rely on streams are not feasible.
I've tried something like:
set -eo pipefail
while head -zn1 | head -c-1 | ifne -n false | paste -sd+ | bc; do :; done
but I only get
6
10
If I leave off bc I get
1+2+3
1+2+3+4
1+2+3+4+5
which is basically correct. This leads me to believe that the issue is potentially related to buffering and the way each process is actually interacting with the pipes between them.
Is there some way to fix the way that these commands exchange streams so that I can get the desired output? Or, alternatively, is there a way to accomplish this with other means?
In principle this is related to this question, and I could certainly write a program that reads stdin into a buffer, looks for the null character, and pipes the output to a spawned subprocess, as the accepted answer does for that question. Given the general support of streams and null delimiters in bash, I'm hoping to do something that's a little more "native". In particular, if I want to go this route, I'll have to escape the pipeline (paste -sd+ | bc) in a string instead of just letting the same shell interpret it. There's nothing too inherently bad about this, but it's a little ugly and will require a bunch of somewhat error prone escaping.
Edit
As was pointed out in an answer, head makes no guarantees about how much it buffers. Unless it only buffers single byte at a time, which would be impractical, this will never work. Thus, it seems like the only solution would be to read it into memory, or write a specific program.
The issue with your original code is that head doesn't guarantee that it won't read more than it outputs. Thus, it can consume more than one (NUL-delimited) chunk of input, even if it's emitting only one chunk of output.
read, by contrast, guarantees that it won't consume more than you ask it for.
set -o pipefail
while IFS= read -r -d '' line; do
bc <<<"${line//$'\n'/+}"
done < <(build_a_stream)
If you want native logic, there's nothing more native than just writing the whole thing in shell.
Calling external tools -- including bc, cut, paste, or others -- involves a fork() penalty. If you're only processing small amounts of data per invocation, the efficiency of the tools is overwhelmed by the cost of starting them.
while read -r -d '' -a numbers; do # read up to the next NUL into an array
sum=0 # initialize an accumulator
for number in "${numbers[#]}"; do # iterate over that array
(( sum += number )) # ...using an arithmetic context for our math
done
printf '%s\n' "$sum"
done < <(build_a_stream)
For all of the above, I tested with the following build_a_stream implementation:
build_a_stream() {
local i j IFS=$'\n'
local -a numbers
for ((i=3; i<=5; i++)); do
numbers=( )
for ((j=0; j<=i; j++)); do
numbers+=( "$j" )
done
printf '%s\0' "${numbers[*]}"
done
}
As discussed, the only real solution seemed to be writing a program to do this specifically. I wrote one in rust called xstream-util. After installing it with cargo install xstream-util, you can pipe the input into
xstream -0 -- bash -c 'paste -sd+ | bc'
to get the desired output
6
10
15
It doesn't avoid having to run the program in bash, so it still needs escaping if the pipeline is complicated. Also, it currently only supports single byte delimiters.

Disk space required for unix sort

I am currently doing a UNIX sort (via GitBash on a Windows machine) of a 500GB text file. Due to running out of space on the main disk, I have used the -T option to direct the temp files to a disk where I have enough space to accommodate the entire file. The thing is, I've been watching the disk space and apparently the temp files are already in excess of what the original file was. I don't know how much further this is going to go, but I'm wondering if there is a rule by which I can predict how much space I will need for temp files.
I'd batch it manually as described in this unix.SE answer.
Find some very basic queries that will divide your content into chunks that are small enough to be sorted. For example, if it's a file of words, you could create queries like grep ^a …, grep ^b …, and so on. Some items may need more granularity than others.
You can script that like:
#!/bin/bash
for char1 in other {0..9} {a..z}; do
out="/tmp/sort.$char1.xz"
echo "Extracting lines starting with '$char1'"
if [ "$char1" = "other" ]; then char1='[^a-z0-9]'; fi
grep -i "^$char1" *.txt |xz -c0 > "$out"
unxz -c "$out" |sort -u >> output.txt || exit 1
rm "$out"
done
echo "It worked"
I'm using xz -0 because it's almost as fast as gzip's default gzip -6 yet it's vastly better at conserving space. I omitted it from the final output in order to preserve the exit value of sort -u, but you could instead use a size check (iirc, sort fails with zero output) and then use sort -u |xz -c0 >> output.txt.xz since the xz (and gzip) container lets you concatenate archives (I've written about that before too).
This works because the output of each grep run is already sorted (0 is before 1, which is before a, etc.), so the final assembly doesn't need to run through sort (note, the "other" section will be slightly different since some non-alphanumeric characters are before the numbers, others are between numbers and letters, and others still are after the letters. You can also remove grep's -i flag and additionally iterate through {A..Z} to be case sensitive). Each individual iteration obviously still needs to be sorted, but hopefully they're manageable.
If the program exits before completing all iterations and saying "It worked" then you can edit the script with a more discrete batch for the last iteration it tried. Remove all prior iterations since they're successfully saved in output.txt.

generating every possible letter and number combination 8 and 63 characters long. in bash

How would a generate every possible letter and number combination into a word list something kind of like "seq -w 0000000000-9999999999 > word-list.txt" or like "echo {a..z}{0..9}{Z..A}", but I need to include letters and length options. Any help? And side info, this will be run on a GTX980 so it won't be too slow, but I am worried about the storage issue, If you have any solutions please let me know.
file1.sh:
#!/bin/bash
echo \#!/bin/bash > file2.sh
for i in $(seq $1 $2)
do
echo -n 'echo ' >> file2.sh
for x in $(seq 1 $i)
do
echo -n {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,\
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,\
0,1,2,3,4,5,6,7,8,9} >> file2.sh
done
echo
done
I don't think it's a good idea, but this ought to generate a file which when executed will generate all possible alphanumeric sequences with lengths between the first and second arguments inclusive. Please don't actually try it for 8 through 64, I'm not sure what will happen but there's no way it will work. Sequences of the same length will be space separated, sequences of different lengths will be separated by newlines. You can send it through tr if you want to replace the spaces with newlines. I haven't tested it and I clearly didn't bother with silly things like input validation, but it seems about right.
If you're doing brute force testing, there should be no need to save every combination somewhere, given how easy and (comparatively) quick it is to generate on the fly. You might also consider looking at one of the many large password files that's available on the internet.
P.S. You should try doing the math to figure out approximately how much you would have to spend on 1TB hard drives to store the file you wanted.

how to reuse stdout output without saving it to physical disk file

I have a for-loop, like the following:
for inf from $filelist; do
for ((i=0; i<imax; ++i)); do
temp=`<command_1> $inf | <command_2>`
eval set -A array -- $temp
...
done
...
done
Problem is, command_1 a bit time consuming and its output is a bit large (900MB is the highest, depending on how big the input file is). So, I modified the script to:
outf="./temp"
for inf from $filelist; do
<command_1> $inf -o $outf
for ((i=0; i<imax; ++i)); do
temp=`cat $outf | <command_2>`
eval set -A array -- $temp
...
done
...
done
There is a little performance improvement, but not so much as I want, probably because disk I/O is a performance bottle-neck as well.
Just curious if there is a way to save the stdout output of command_1, so that I could reuse it without saving it to a physical disk file?
don't use pipelines inside nested loops
Based on new comments and another look at the original question, I would strongly recommend against using a pipeline processing large amounts of data inside a nested loop. Shell pipelines are far from efficient, and incur lots of process overhead.
Look at the original problem, this involves looking into the contributions of command_1 and command_2, and see if you could solve this in another way.
That said: here's the original answer:
In the shell there are two ways of storing data: either in a shell variable, or in a file. You might try to store that file in a memory based filesystem, like /dev/shm on linux or tmpfs in Solaris.
You might also analyse command_1 and command_2 for optimisations. Is there anything in the output of command_1 that's not needed by command_2? Try to put a filter between the two.
Example:
command_1 | awk '{ print $2 }' | command_2
(Assuming command_2 only needs column 2 of command_1's output.)

Resources