bash script problem with understanding how shuf works - bash

I have the following problem understanding this line of code
for NUMBER in $(shuf -i1-$MAX_NUMBER)
Do I understand correctly that I take subsequent numbers up to "$MAX_NUMBER" or the function "shuf -i1-" make any changes?

shuf -i1-$MAX_NUMBER prints a random permutation of the numbers in the range of 1 to $MAX_NUMBER (i.e, not subsequent).
This means that in each iteration of the loop, the value of $NUMBER will be a random value between 1 and $MAX_NUMBER, until all numbers have been used.

Related

Shouldn't the same srand value produce the same random numbers?

When I repeatedly run this code,
srand 1;
my #x = (1..1000).pick: 100;
say sum #x;
I get different answers each time. If I'm resetting with srand why shouldn't it produce the same random numbers each time?
The error occurs in the REPL.
The error occurs in this file:
use v6.d;
srand 1;
my $x = rand;
say $x; # OUTPUT: 0.5511548437617427
srand 1;
$x = rand;
say $x; # OUTPUT: 0.308302962221659
say $*KERNEL; # OUTPUT: darwin
I'm using:
Welcome to Rakudo™ v2022.07.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2022.07.
It should produce the same numbers for a given piece of code all of the time. And I haven't been able to reproduce your observation in any way.
There may be something spooky going on under the hood, though:
$ raku -e 'srand 1; (my $x = (1..1000).pick(1)).say'
(761)
$ raku -e 'srand 1; (my #x = (1..1000).pick(1)).say'
[471]
On the surface, you'd say that these values should be the same, as each only generates a single value. But apparently a different number of random values is actually calculated under the hood, causing the visibly different values. Is that perhaps what is going on in your case?
(This answer is a paraphrase of jnthn's comment in the GitHub issue opened based on this question).
Setting srand 1 will cause the same sequence of random numbers to be generated -- that is, the nth random number will be the same. However, since Raku (really, Rakudo and/or MoarVM, assuming you're using those backends) uses random numbers internally, you won't always be in the same position in that sequence (i.e., your n might be different) and thus you might not get the same random number.
This is further complicated by Rakudo's optimizer. Naively, repeating the same code later in the program should consume the same number of random numbers from the sequence. However, the optimizer may well remove some of those random number uses from subsequent calls, which can result in different random numbers.
I'm unclear to what degree the current behavior is intended versus a bug in Rakudo/MoarVM's implementation; please see the previously linked issue for additional details.

bash - Expliciting repetitions in a sequence : how to make AACCCC into 2A4C?

I am looking for a way to quantify the repetitiveness of a DNA sequence. My question is : how are distributed the tandem repeats of one single nucleotide within a given DNA sequence?
To answer that I would need a simple way to "compress" a sequence where there are identical letters repeated several times.
For instance:
AAAATTCGCATTTTTTAGGTA --> 4A2T1C1G1C1A6T1A2G1T1A
From this I would be able to extract the numbers to study the distribution of the repetitions (probably a Poisson distribution I would say), like :
4A2T1C1G1C1A6T1A2G1T1A --> 4 2 1 1 1 1 6 1 2 1 1
The limiting step for me is the first one. There are some topics which give an answer to my question but I am looking for a bash solution using regular expressions.
how to match dna sequence pattern (solution in C++)
Analyze tandem repeat motifs in DNA sequences (solution in python)
Sequence Compression? (solution in Javascript)
So if my questions inspires some regex kings, it would help me a lot.
If there is a software that does this I would take it for sure as well!
Thanks all, I hope I was clear enough
Egill
As others mentioned, Bash might not be ideal for data crunching. That being said, the compression part is not that difficult to implement:
#!/usr/bin/env bash
# Compress DNA sequence [$1: sequence string, $2: name of output variable]
function compress_sequence() {
local input="$1"
local -n output="$2"; output=""
local curr_char="" last_char="${input:0:1}" char_count=1 i
for ((i=1; i <= ${#input}; i++)); do
curr_char="${input:i:1}"
if [[ "${curr_char}" != "${last_char}" ]]; then
output+="${char_count}${last_char}"
last_char="${curr_char}"
char_count=1
else
char_count=$((char_count + 1))
fi
done
}
compress_sequence "AAAATTCGCATTTTTTAGGTA" compressed
echo "${compressed}"
This algorithm processes the sequence string character by character, counts identical characters and adds <count><char> to the output whenever characters change. I did not use regular expressions here and I'm pretty sure there wouldn't be any benefits in doing so.
I might as well add the number extracting part as it is trivial:
numbers_string="${compressed//[^0-9]/ }"
numbers_array=(${numbers_string})
This replaces everything that is not a digit with a space. The array is just a suggestion for further processing.

loop to check if the multiples of a user defined number are even within a user defined range

hey everyone I am trying to write a script in bash that will take a user-defined number and run its multiples up then check which of those multiples are even and print those only in a user-defined range. and the script seems to be working when an even number is selected as the value but when an odd number is selected it only prints out half of the numbers you wish to see. I think I know why it happens it is to do with my while statement with the $mult -le $range but I am not sure what to do to fix this or how to get it to show the full range for both even and odd based numbers. Any help is appreciated thanks.
code
#!/bin/sh
echo "Please input a value"
read -r val
echo "Please input how many multiples of this number whose term is even you wis>
read -r range
mult=1
while [ $mult -le $range ]
do
term=$(($val*$mult))
if [[ $(($term % 2)) -eq 0 ]]
then echo "$term"
fi
((mult++))
done
echo "For the multiples of $val these are the $range who's terms were even"
output even based number
$ ./test3.sh
Please input a value
8
Please input how many multiples of this number whose term is even you wish to see
4
8
16
24
32
For the multiples of 8 these are the 4 whose terms were even
output odd based number
$ ./test3.sh
Please input a value
5
Please input how many multiples of this number whose term is even you wish to see
10
10
20
30
40
50
For the multiples of 5 these are the 10 whose terms were even
Your current while condition assumes that the number of even multiples of the number val less than or equal to val * range is at least range. In fact, for even numbers, there are precisely range even multiples which are less than or equal to val * range. This is not the case for odd numbers - as you've encountered.
You'll need to introduce a new variable to solve this problem - one that keeps track of the number of even multiples you have encountered thus far. Once you reach the desired number, the loop should terminate.
So, you could set a counter initially
count=0
and check this in your while loop condition
while [ $count -lt $range ]
You would increment count each time you enter the body of the if - i.e. whenever you encounter an even multiple.
This should give you the desired behavior.

Counting integer frequency through pipe

Description
I have a for loop in bash with 10^4 iterations in total. Each iteration a list of roughly 10^7 numbers is generated from a pipe, each number an integer between 1 and 10^8. I want to keep track of how many times each integer appeared. The ideal output would be a .txt file with 10^8 lines, each line containing a counter for the integer corresponding to the row number.
As a significant proportion of integers do not appear while others appear nearly every iteration, I imagined using a hashmap, so as to limit analysis to numbers that have appeared. However, I do not know how to fill it with numbers appearing sequentially from a pipe. Any help would be greatly appreciated!
Reproducible example:
sample.R
args = commandArgs(trailingOnly=TRUE)
n_samples = as.numeric(args[1])
n_max = as.numeric(args[2])
v = as.character(sample(1:n_max, n_samples))
writeLines(v)
for loop:
for i in {1..n_loops}
do
Rscript sample.R n_samples n_max | "COLLECT AND INCREMENT HERE"
done
, where in my case n_loops=10^4, n_samples=10^7, n_max = 10^8.
Simple Approach
Before doing premature optimization, try the usual approach with sort | uniq -c first -- if that is fast enough, you have less work and a shorter script. To speed things up without too much hassle, set the memory using -S and use the simplest text encoding LC_ALL=C.
for i in {1..10000}; do
Rscript sample.R n_samples n_max
done | LC_ALL=C sort -nS40% | LC_ALL=C uniq -c
The output will have lines of the form number_of_matches integer_from_the_output. Only integers which appeared at least once will be listed.
To convert this format (inefficiently) into your preferred format with 108 lines, each containing the count for the integer corresponding to the line number, replace the ... | sort | uniq -c part with the following command:
... | cat - <(seq 100''000''000) | LC_ALL=C sort -nS40% | LC_ALL=C uniq -c | awk '{$1--;$2=""}1'
This assumes that all the generated integers are between 1 and 108 inclusive. The result gets mangled if any other values appear more than once.
Hash Map
If you want to go with the hash map, the simplest implementation would probably be an awk script:
for i in {1..10000}; do
Rscript sample.R n_samples n_max
done | awk '{a[$0]++} END {for (ln=1; ln<=100000000; ln++) print int(a[ln])}'
However, I'm unsure whether this is such a good idea. The hash map could allocate much more memory than the actual data requires and is probably slow for that many entries.
Also, your awk implementation has to support large numbers. 32-bit integers are not sufficient. If the entire output is just the same integer repeated over and over again you can get a up to ...
104 iterations * 107 occurrences / iteration = 104+7 occurrences = 1011 occurrences
... of that integer. To store the maximal count of 1011 you need at least 37 bits > log2(1011) bits.
GNU awk 5 on a 64-bit system seems to handle numbers of that size.
Faster Approach
Counting occurrences in a data structure is a good idea. However, a hash map is overkill as you have "only" 108 possible values as output. Therefore, you can use an array with 108 entries of 64-bit counters. The array would use ...
64 bit * 108 = 8 Byte * 102+6 = 800 MByte
... of memory. I think 800 MByte should be free even on old PCs and Laptops from 10 years ago.
To implement this approach, use a "normal" programming language of your choice. Bash is not the right tool for this job. You can use bash to pipe the output of the loop into your program. Alternatively, you can execute the for loop directly in your program.

Unexpected arithmetic result with zero padded numbers

I have a problem in my script wherein I'm reading a file and each line has data which is a representation of an amount. The said field always has a length of 12 and it's always a whole number. So let's say I have an amount of 25,000, the data will look like this 000000025000.
Apparently, I have to get the total amount of these lines but the zero prefixes are disrupting the computation. If I add the above mentioned number to a zero value like this:
echo $(( 0 + 000000025000 ))
Instead of getting 25000, I get 10752 instead. I was thinking of looping through 000000025000 and when I finally get a non-zero value, I'm going to substring the number from that index onwards. However, I'm hoping that there must be a more elegant solution for this.
The number 000000025000 is an octal number as it starts with 0.
If you use bash as your shell, you can use the prefix 10# to force the base number to decimal:
echo $(( 10#000000025000 ))
From the bash man pages:
Constants with a leading 0 are interpreted as octal numbers. A leading 0x or 0X denotes hexadecimal. Otherwise, numbers take the form [base#]n, where the optional base is a decimal number between 2 and 64 representing the arithmetic base, and n is a number in that base.
Using Perl
$ echo "000000025000" | perl -ne ' { printf("%d\n",scalar($_)) } '
25000

Resources