How to compare multiple extension-less files in Bash

How to compare multiple extension-less files in Bash - bash

I'm new to bash shell scripting.
How can I compare 8 outputs of extension-less files (with only binary values) - same length of values, 0 or 1.
To clarify things, this is what I've done so far.
for d in */; do
find . -name base -execdir sh -c 'cat {} >> out' \;
done
I've Found all the files that are located in sub-folders, read & concatenated all the binary files into out file.
Now I have 8 out files (8 parent folders) that I need to compare with.
I've tried both "diff" and "cmp" - but they both work only with 2 files.
At the end, I need to check and verify if there is a difference between this 8 binary files and eventually to export the results and represent them in HEX format - example: if 2 of the out files are all '1' = F , and if all '0' = 0 . hence, the final results should be for example : FFFF 0000 (4 first files are all '1' , 4 last files are all '0').
What is the best option to do so? - Hope that I've managed to clarify my case.
Thanks a lot for the help.

Let me assume:
We have 8 (presumably binary) files, say: dir1/out.txt, dir2/out.txt, ..
dir8/out.txt.
We want to compare among these files and identify which files are identical
and which are not.
Then how about the steps:
To generate hash values of the files with e.g. sha256sum.
To compare the hash values and divide into groups based on the hash values.
I have created 8 test files, of those dir1/out.txt, dir2/out.txt and dir4/out.txt
are indentical, dir3/out.txt and dir7/out.txt are identical, and others
differ.
Then the hash values will look like:
sha256sum dir*/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir1/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir2/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir3/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir4/out.txt
f45151f5253c62de69c95935f083b5649876fdb661412d4f32065a7b018bf68b dir5/out.txt
bdc26931acfb734b142a8d675f205becf27560dc461f501822de13274fe6fc8a dir6/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir7/out.txt
11a77c3d96c06974b53d7f40a577e6813739eb5c811b2a86f59038ea90add772 dir8/out.txt
To summarize the result, let me replace the hash values with group id, having
the same number for the same files in occurance order.
Here's the script:
sha256sum dir*/out.txt | awk '{if (!gid[$1]) gid[$1] = ++n; print $2 " " gid[$1]}'
The output:
dir1/out.txt 1
dir2/out.txt 1
dir3/out.txt 2
dir4/out.txt 1
dir5/out.txt 3
dir6/out.txt 4
dir7/out.txt 2
dir8/out.txt 5
where the second field shows the group id to indicate which files are identical.
Note that the group id does not represent the content of each file as:
if 2 of the out.txt files are all '1' = F , and if all '0' = 0,
because I have no idea how the files look like. If OP can provide the
example files, I could be more help.
BTW I'm still in doubt if the files are binary in ordinary sense because
OP is mentioning that "it's simply a file that contains 0 or 1 in its
value when I open it". It sounds to me the files are composed of
ascii "0"s and "1"s. My script above should work for both binary files
and text files anyway.
[Update]
According to the OP's information, here's a solution for the specific case:
#!/bin/bash
for f in dir*/out.txt; do
if [[ $(uniq "$f" | wc -l) = 1 ]]; then
echo -n "$(head -1 "$f" | tr 1 F)"
else
echo -n "-"
fi
done
echo
It digests the contents of each file to either of: 0 for all 0's, F for all 1's or - for the mixture case (possible error).
For instance, if dir{1..4}/out.txt are all 0's, dir5/out.txt is a mixture, and dir{6..8}/out.txt are all 1's, then the output will look like:
0000-FFF
I hope it will meet the OP's requirements.

If you are looking for records that are unique in your list of files
cat $path/$files|uniq -u>/tmp/output.txt
grep -f /tmp/output.txt $path/$files

Related

Bash: checking substring increments with modular arithmetic

I have a list of files with file names that contain a substring of 6 numbers that represents HHMMSS, HH: 2 digits hour, MM: 2 digits minutes, SS: 2 digits seconds.
If the list of files is ordered, the increments should be in steps of 30 minutes, that is, the first substring should be 000000, followed by 003000, 010000, 013000, ..., 233000.
I want to check that no file is missing iterating the list of files and checking that neither of these substrings is missing. My approach:
string_check=000000
for file in ${file_list[#]}; do
if [[ ${file:22:6} == $string_check ]]; then
echo "Ok"
else
echo "Problem: an hour (file) is missing"
exit 99
fi
string_check=$((string_check+3000)) #this is the key line
done
And the previous to the last line is the key. It should be formatted to 6 digits, I know how to do that, but I want to add time like a clock, or, in more specific words, modular arithmetic modulo 60. How can that be done?

Assumptions:
all 6-digit strings are of the format xx[03]0000 (ie, has to be an even 00 or 30 minutes and no seconds)
if there are strings like xx1529 ... these will be ignored (see 2nd half of answer - use of comm - to address OP's comment about these types of strings being an error)
Instead of trying to do a bunch of mod 60 math for the MM (minutes) portion of the string, we can use a sequence generator to generate all the desired strings:
$ for string_check in {00..23}{00,30}00; do echo $string_check; done
000000
003000
010000
013000
... snip ...
230000
233000
While OP should be able to add this to the current code, I'm thinking we might go one step further and look at pre-parsing all of the filenames, pulling the 6-digit strings into an associative array (ie, the 6-digit strings act as the indexes), eg:
unset myarray
declare -A myarray
for file in ${file_list}
do
myarray[${file:22:6}]+=" ${file}" # in case multiple files have same 6-digit string
done
Using the sequence generator as the driver of our logic, we can pull this together like such:
for string_check in {00..23}{00,30}00
do
[[ -z "${myarray[${string_check}]}" ]] &&
echo "Problem: (file) '${string_check}' is missing"
done
NOTE: OP can decide if the process should finish checking all strings or if it should exit on the first missing string (per OP's current code).
One idea for using comm to compare the 2 lists of strings:
# display sequence generated strings that do not exist in the array:
comm -23 <(printf "%s\n" {00..23}{00,30}00) <(printf "%s\n" "${!myarray[#]}" | sort)
# OP has commented that strings not like 'xx[03]000]` should generate an error;
# display strings (extracted from file names) that do not exist in the sequence
comm -13 <(printf "%s\n" {00..23}{00,30}00) <(printf "%s\n" "${!myarray[#]}" | sort)
Where:
comm -23 - display only the lines from the first 'file' that do not exist in the second 'file' (ie, missing sequences of the format xx[03]000)
comm -13 - display only the lines from the second 'file' that do not exist in the first 'file' (ie, filenames with strings not of the format xx[03]000)
These lists could then be used as input to a loop, or passed to xargs, for additional processing as needed; keeping in mind the comm -13 output will display the indices of the array, while the associated contents of the array will contain the name of the original file(s) from which the 6-digit string was derived.

Doing this easy with POSIX shell and only using built-ins:
#!/usr/bin/env sh
# Print an x for each glob matched file, and store result in string_check
string_check=$(printf '%.0sx' ./*[0-2][0-9][03]000*)
# Now string_check length reflects the number of matches
if [ ${#string_check} -eq 48 ]; then
echo "Ok"
else
echo "Problem: an hour (file) is missing"
exit 99
fi
Alternatively:
#!/usr/bin/env sh
if [ "$(printf '%.0sx' ./*[0-2][0-9][03]000*)" \
= 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' ]; then
echo "Ok"
else
echo "Problem: an hour (file) is missing"
exit 99
fi

How can I find the missing integers in a unique and sequential list (one per line) in a unix terminal?

Suppose I have a file as follows (a sorted, unique list of integers, one per line):
1
3
4
5
8
9
10
I would like the following output (i.e. the missing integers in the list):
2
6
7
How can I accomplish this within a bash terminal (using awk or a similar solution, preferably a one-liner)?

Using awk you can do this:
awk '{for(i=p+1; i<$1; i++) print i} {p=$1}' file
2
6
7
Explanation:
{p = $1}: Variable p contains value from previous record
{for ...}: We loop from p+1 to the current row's value (excluding current value) and print each value which is basically the missing values

Using seq and grep:
seq $(head -n1 file) $(tail -n1 file) | grep -vwFf file -
seq creates the full sequence, grep removes the lines that exists in the file from it.

perl -nE 'say for $a+1 .. $_-1; $a=$_'

Calling no external program (if filein contains the list of numbers):
#!/bin/bash
i=0
while read num; do
while (( ++i<num )); do
echo $i
done
done <filein

To adapt choroba's clever answer for my own use case, I needed my sequence to deal with zero-padded numbers.
The -w switch to seq is the magic here - it automatically pads the first number with the necessary number of zeroes to keep it aligned with the second number:
-w, --equal-width equalize width by padding with leading zeroes
My integers go from 0 to 9999, so I used the following:
seq -w 0 9999 | grep -vwFf "file.txt"
...which finds the missing integers in a sequence from 0000 to 9999. Or to put it back into the more universal solution in choroba's answer:
seq -w $(head -n1 "file.txt") $(tail -n1 "file.txt") | grep -vwFf "file.txt"
I didn't personally find the - in his answer was necessary, but there may be usecases which make it so.

Using Raku (formerly known as Perl_6)
raku -e 'my #a = lines.map: *.Int; say #a.Set (^) #a.minmax.Set;'
Sample Input:
1
3
4
5
8
9
10
Sample Output:
Set(2 6 7)
I'm sure there's a Raku solution similar to #JJoao's clever Perl5 answer, but in thinking about this problem my mind naturally turned to Set operations.
The code above reads lines into the #a array, mapping each line so that elements in the #a array are Ints, not strings. In the second statement, #a.Set converts the array to a Set on the left-hand side of the (^) operator. Also in the second statement, #a.minmax.Set converts the array to a second Set, on the right-hand side of the (^) operator, but this time because the minmax operator is used, all Int elements from the min to max are included. Finally, the (^) symbol is the symmetric set-difference (infix) operator, which finds the difference.
To get an unordered whitespace-separated list of missing integers, replace the above say with put. To get a sequentially-ordered list of missing integers, add the explicit sort below:
~$ raku -e 'my #a = lines.map: *.Int; .put for (#a.Set (^) #a.minmax.Set).sort.map: *.key;' file
2
6
7
The advantage of all Raku code above is that finding "missing integers" doesn't require a "sequential list" as input, nor is the input required to be unique. So hopefully this code will be useful for a wide variety of problems in addition to the explicit problem stated in the Question.
OTOH, Raku is a Perl-family language, so TMTOWTDI. Below, a #a.minmax array is created, and grepped so that none of the elements of #a are returned (none junction):
~$ raku -e 'my #a = lines.map: *.Int; .put for #a.minmax.grep: none #a;' file
2
6
7
https://docs.raku.org/language/setbagmix
https://docs.raku.org/type/Junction
https://raku.org

Selecting a random read from pair of reads in fastq file

I have a question about random selection of a read from a sampled pair-end fastq files. I read some topics regarding this manner but none could solve my problem, which is:
I got two fastq files R1.fastq and R2.fastq. What I want to achieve is to randomly sample those files and from each sampled pair of reads I want to randomly select only one read.
What I did so far is...
I sampled my files using seqtk:
seqtk sample -s100 R1.fastq 10000 > R1_sample.fastq
seqtk sample -s100 R2.fastq 10000 > R2_sample.fastq
then I sorted each file by sequence ID like this:
paste - - - - < R1_sample.fastq | sort -k1 -t " " | tr "\t" "\n" > R1_sample_sorted.fastq
I did the same with R2_sample.fastq. Then I merged both sorted files so that R1 would be in one column and R2 in the second column:
pr -mts R1_sample_sorted.fastq R2_sample_sorted.fastq > merged.fastq
the file looks like this:
#D3YGT8Q1:297:C7T4RACXX:3:1101:1000 #D3YGT8Q1:297:C7T4RACXX:3:1101:1000
TGATGTTTGGATGTAAAGTGAAATATTAGTTGGCG AGCTTTCCTCACTATCTGCTTCATCCGCCAACTAA
+ +
BBBFFFFFFFFFFFIFFIFFIIIIFIIIFIIFIII B0<FFFFFFFFFFIIIIIIIIIIIIIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:1000 #D3YGT8Q1:297:C7T4RACXX:3:1101:1000
CCTCCTAGGCGACCCAGACAATTATACCCTAGCCA TGTTTAAGGGGTTGGCTAGGGTATAATTGTCTGGG
+ +
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIII BBBFFFFFFFFFFIIIIIIIIBFFIIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:1000 #D3YGT8Q1:297:C7T4RACXX:3:1101:1000
TTCTATTTATTACCTCAGAAGTTTTTTTCTTCGCA GTAAAAGGCTCAGAAAAATCCTGCGAAGAAAAAAA
+ +
BBBFFFFFFFFFFIIIIIIIIFIIFIIIFIIIIII BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIII
And now I want to randomly select only one read from each pair. My initial idea was to use shuf to get a random number from range 1-2:
shuf -i1-2 -n1
and then somehow select the read cooresponding to the number I got from shuf. For example in the first iteration I got 1 so I pick the read from column 1, in the socond iteration I got 2 so from the next pair of reads I pick the read in the second column etc.
I got stuck here. So my question is, is there a neat way to do this? Maybe with awk or some other method? Any help will be very appreciated.
Comment to Ashafixs answer:
Thanks for your respond and sorry for the huge delay...!
I've tested your solutions and they both seem to have flaws.
For the first script I constructed test fastq files R1 and R2 each containing 6 reads. After running the script I expect it to output 6 reads as well (24 lines) in the correct order(ID,seq,desc,qual) but as a set of reads randomly selected from R1 or R2 file. What I got from the script is:
#D3YGT8Q1:297:C7T4RACXX:3:1101:10002:27381 2:N:0:ATGCTCGTTCTCTCGT
AGCTTTCCTCACTATCTGCTTCATCCGCCAACTAATATTTCACTTTACATCCAAACATCAAGATC
+
B0<FFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIFIFIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:50631 2:N:0:ATGCTCGTTCTCTCGT
#D3YGT8Q1:297:C7T4RACXX:3:1101:10007:32152 1:N:0:ATGCTCGTTCTCTCGT
GTAAGGTTAGGAGGGTGTTAATTATTAAAATTAAGGCGAAGTTTATTACTCTTTTTTGAATGTTG
+
BBBFFFFFFFFFFIIBFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFF
You can see that the output is not correct. The second read is missing three lines and there should be six reads in total not three. In addition each time I run the script it outputs different number of reads.
For the second script I input a merged fastq file like described above. The output looks similar to the first script output:
#D3YGT8Q1:297:C7T4RACXX:3:1101:10002:27381 2:N:0:ATGCTCGTTCTCTCGT
AGCTTTCCTCACTATCTGCTTCATCCGCCAACTAATATTTCACTTTACATCCAAACATCAAGATC
+
B0<FFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIFIFIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:50631 2:N:0:ATGCTCGTTCTCTCGT
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:50631 2:N:0:ATGCTCGTTCTCTCGT
TGTTTAAGGGGTTGGCTAGGGTATAATTGTCTGGGTCGCCTAGGAGGAGATCGGAAGAGCGTCGT
+
BBBFFFFFFFFFFIIIIIIIIBFFIIIIIIIIIIIFFFIIIIIIFIIIIIFIIIFFFFFFFFFFF
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:88140 1:N:0:ATGCTCGTTCTCTCGT
ACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGAGAAGCTCTTCATC
+
BBBFFFFFFFFFFIIIIIIIIIIFIIIIIIFIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:10007:32152 2:N:0:ATGCTCGTTCTCTCGT
CTAGTTTTGACAACATTCAAAAAAGAGTAATAAACTTCGCCTTAATTTTAATAATTAACACCCTC
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIII
but this time I always get five reads. Still missing one. And the second and third read headers are the same. This should not happen.

You could try the following script (it also works as one liner). First it gets all the headers from your first fastq file, then it randomly picks a fastq file and returns 4 lines from it.
Please note: This only works if both files have identical headers at identical positions.
#!/bin/bash
headers=$(grep # R1_sample.fastq)
var=1
for line in $headers ; do
r=$(shuf -i1-2 -n1)
tail -n +$var "R$r"_sample.fastq | grep -m 1 -A 4 $line
var=$((var+4))
done
Alternatively you could expand your merge and pick a column approach. cut is used to remove a random column from the merged output.
#!/bin/bash
headers=$(grep # merged.fastq)
var=1
for line in $headers ; do
r=$(shuf -i1-2 -n1)
tail -n +$var merged.fastq | grep -m 1 -A 4 $line | cut -d$'\t' -f$r
var=$((var+4))
done

Join in unix when field is numeric in a huge file

So I have two files. File A and File B. File A is huge (>60 GB) and has 16 rows, a mix of numeric and strings, is separated by "|", and has over 600,000,000 lines. Field 3 in this file is the ID and it is a numeric field, with different lengths (e.g., someone's ID can be 1, and someone else's can be 100)
File B just has a bunch of ID (~1,000,000) and I want to extract all the rows from File A that have an ID that is in `File B'. I have started doing this using Linux with the following code
sort -k3,3 -t'|' FileA.txt > FileASorted.txt
sort -k1,1 -t'|' FileB.txt > FileBSorted.txt
join -1 3 -2 1 -t'|' FileASorted.txt FileBSorted.txt > merged.txt
The problem I have is that merged.txt is empty (when I know for a fact there are at least 10 matches)... I have googled this and it seems like the issue is that the join field (the ID) is numeric. Some people propose padding the field with zeros but 1) I'm not entirely sure how to do this, and 2) this seems very slow/time inefficient.
Any other ideas out there? or help on how to add the padding of 0s only to the relevant field.

I would first sort file b using the unique flag (-u)
sort -u file.b > sortedfile.b
Then loop through sortedfile.b and for each grep file.a. In zsh I would do a
foreach C (`cat sortedfile.b`)
grep $C file.a > /dev/null
if [ $? -eq 0 ]; then
echo $C >> res.txt
fi
end
Redirect output from grep to /dev/null and test whether there was a match ($? -eq 0) and append (>>) the result from that line to res.txt.
A single > will overwrite the file. I'm a bit rusty at zsh now so there might be a typo. You may be using bash which can have a slightly different foreach syntax.

How to read a file located in in several folders and subfolders in Bash Shell

There are several files named TESTFILE which located in directories ~/main1/sub1, ~/main1/sub2, ~/main1/sub3, ..., ~/main2/sub1,~/main2/sub2, ... ~/mainX/subY where mainX is the main folder and subY are the subfolders inside the main folder. The TESTFILE file for each main folder-subfolder has the same pattern, but the data in each is unique.
Now here's what I want to do:
I want to read a specific number in the TESTFILE for each ~/mainX/subY.
I want to create a text file where every line has the following format [mainX][space][subY][space][value read from TESTFILE]
Some information about TESTFILE and the data I want to get:
It is an OSZICAR file from VASP, a DFT program
The number of lines in OSZICAR varies in different folder-subfolder combination
The information I want to get is always located in the last two lines of the file
The last two lines always look like this:
DAV: 2 -0.942521930239E+01 0.27889E-09 -0.79991E-13 864 0.312E-06
10 F= -.94252193E+01 E0= -.94252193E+01 d E =-.717252E-07
Or in general, the last two lines pattern is:
DAV: a b c d e f
g F= h E0= i d E = j
where the italicized parts are the parts that do not change and boldfaced variable are the ones that I want to get
Some information about main folder mainX and sub-folder subY:
The folders mainX and subY are all real numbers.
How I want the output to be:
Suppose mainX={0.12, 0.20, 0.34, 0.7} and subY={1.10, 2.30, 4.50, 1.00, 2.78}, and the last two lines of ~/0.12/1.10/OSZICAR is the example above, my output file should contain:
0.12 1.10 2 10 -.94252193E+01 -.94252193E+01 -.717252E-07
...
0.7 2.30 2 10 -.94252193E+01 -.94252193E+01 -.717252E-07
...
mainX mainY a g h i j
How do I do this in the simplest way possible? I'm reading grep, awk, sed and I'm very overwhelmed.

You could do this using some for loops in bash:
for m in ~/main*/; do
main=$(basename "$m")
for s in "$m"sub*/; do
sub=$(basename "$s")
num=$(tail -n2 TESTFILE | awk -F'[ =]+' 'NR==1{s=$2;next}{print s,$1,$3,$5,$8}')
echo "$main $sub $num"
done
done > output_file
I have modified the command to extract the data from your file. It uses tail to read the last two lines of the file. The lines are passed to awk, where they are split into fields using any number of spaces and = signs together as the field separator. The second field from the first of the two lines is saved to the variable s. next skips to the next line, then the columns that you are interested in are printed.

Your question is not very clear - specifically on how to extract the value from TESTFILE, but this is something like what you want:
#!/bin/bash
for X in {1..100}; do
for Y in {1..100}; do
directory="main${X}/sub${Y}"
echo Checking $directory
if [ -f "${directory}/TESTFILE" ]; then
something=$(grep something "${directory}/TESTFILE")
echo main${X} sub${Y} $something
fi
done
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio