Is there a way to split a large file into chunks with random sizes? - bash

I know you can split a file with split, but for test purposes I would like to split a large file into chunks whose sizes differ. Is this possible?
Alternatively, if the above-mentioned file is a zip, is there a way to split it into volumes of unequal sizes?
Any suggestions welcome! Thanks!

So the general question that you're asking is: how can I compute N random integers that sum to S? Specifically, S is the size of your file and N is how many smaller files that you want to break it into.
For example, assume that you want to split your file into 4 parts. If a, b, c, and d are four random numbers, then:
a + b + c + d = X
a/X + b/X + c/X + d/X = 1
S*a/X + S*b/X + S*c/X + S*d/X = S
Giving us four random numbers that sum to S, the size of your file.
Which means you'd want to write a script that:
Computes N random numbers (any random numbers).
Computes X as the sum of those random numbers.
Multiplies each of those random numbers by S/X (and makes sure you're left with integers greater than 0 that sum to S)
Splits the original file into pieces using the generated random numbers as sizes, using whatever tool you want.
This is a little much for a shell script, but would be pretty straight forward in something like Perl.

since you tagged the question only with shell. so I supposed you want to handle it only with shell script and those common linux command/tools.
As far as I know there is no existing tool/cmd can split file randomly. To split file, we can consider to use split, dd
Both tools support options like, how big (size) split-ed file should be or how many files do you want to split. let's say, we use dd/split first split your file into 500 parts, each file has same size. so we have:
foo.zip.001
foo.zip.002
foo.zip.003
...
foo.zip.500
then we take this file list as input, to do merge (cat). This step could be done by awk or shell script.
for example we can build a set of cat statements like:
cat foo.zip.001, foo.zip.002 > part1
cat foo.zip.003, foo.zip.004, foo.zip.005 > part2
cat foo.zip.006, foo.zip.007, foo.zip.008, foo.zip.009 > part3
....
run the generated cat statements, you got final part1-n, each part has different size.
for example like:
kent$ seq -f'foo.zip.%g' 20|awk 'BEGIN{i=k=2}NR<i{s=s sprintf ("%s,",$0);next}{k++;i=(NR+k);print "cat "s$0" >part"k-2;s="" }'
cat foo.zip.1,foo.zip.2 >part1
cat foo.zip.3,foo.zip.4,foo.zip.5 >part2
cat foo.zip.6,foo.zip.7,foo.zip.8,foo.zip.9 >part3
cat foo.zip.10,foo.zip.11,foo.zip.12,foo.zip.13,foo.zip.14 >part4
cat foo.zip.15,foo.zip.16,foo.zip.17,foo.zip.18,foo.zip.19,foo.zip.20 >part5
but how is the performance you have to test on your own...at least this should work for your requirement.

Related

bash - Expliciting repetitions in a sequence : how to make AACCCC into 2A4C?

I am looking for a way to quantify the repetitiveness of a DNA sequence. My question is : how are distributed the tandem repeats of one single nucleotide within a given DNA sequence?
To answer that I would need a simple way to "compress" a sequence where there are identical letters repeated several times.
For instance:
AAAATTCGCATTTTTTAGGTA --> 4A2T1C1G1C1A6T1A2G1T1A
From this I would be able to extract the numbers to study the distribution of the repetitions (probably a Poisson distribution I would say), like :
4A2T1C1G1C1A6T1A2G1T1A --> 4 2 1 1 1 1 6 1 2 1 1
The limiting step for me is the first one. There are some topics which give an answer to my question but I am looking for a bash solution using regular expressions.
how to match dna sequence pattern (solution in C++)
Analyze tandem repeat motifs in DNA sequences (solution in python)
Sequence Compression? (solution in Javascript)
So if my questions inspires some regex kings, it would help me a lot.
If there is a software that does this I would take it for sure as well!
Thanks all, I hope I was clear enough
Egill
As others mentioned, Bash might not be ideal for data crunching. That being said, the compression part is not that difficult to implement:
#!/usr/bin/env bash
# Compress DNA sequence [$1: sequence string, $2: name of output variable]
function compress_sequence() {
local input="$1"
local -n output="$2"; output=""
local curr_char="" last_char="${input:0:1}" char_count=1 i
for ((i=1; i <= ${#input}; i++)); do
curr_char="${input:i:1}"
if [[ "${curr_char}" != "${last_char}" ]]; then
output+="${char_count}${last_char}"
last_char="${curr_char}"
char_count=1
else
char_count=$((char_count + 1))
fi
done
}
compress_sequence "AAAATTCGCATTTTTTAGGTA" compressed
echo "${compressed}"
This algorithm processes the sequence string character by character, counts identical characters and adds <count><char> to the output whenever characters change. I did not use regular expressions here and I'm pretty sure there wouldn't be any benefits in doing so.
I might as well add the number extracting part as it is trivial:
numbers_string="${compressed//[^0-9]/ }"
numbers_array=(${numbers_string})
This replaces everything that is not a digit with a space. The array is just a suggestion for further processing.

Counting integer frequency through pipe

Description
I have a for loop in bash with 10^4 iterations in total. Each iteration a list of roughly 10^7 numbers is generated from a pipe, each number an integer between 1 and 10^8. I want to keep track of how many times each integer appeared. The ideal output would be a .txt file with 10^8 lines, each line containing a counter for the integer corresponding to the row number.
As a significant proportion of integers do not appear while others appear nearly every iteration, I imagined using a hashmap, so as to limit analysis to numbers that have appeared. However, I do not know how to fill it with numbers appearing sequentially from a pipe. Any help would be greatly appreciated!
Reproducible example:
sample.R
args = commandArgs(trailingOnly=TRUE)
n_samples = as.numeric(args[1])
n_max = as.numeric(args[2])
v = as.character(sample(1:n_max, n_samples))
writeLines(v)
for loop:
for i in {1..n_loops}
do
Rscript sample.R n_samples n_max | "COLLECT AND INCREMENT HERE"
done
, where in my case n_loops=10^4, n_samples=10^7, n_max = 10^8.
Simple Approach
Before doing premature optimization, try the usual approach with sort | uniq -c first -- if that is fast enough, you have less work and a shorter script. To speed things up without too much hassle, set the memory using -S and use the simplest text encoding LC_ALL=C.
for i in {1..10000}; do
Rscript sample.R n_samples n_max
done | LC_ALL=C sort -nS40% | LC_ALL=C uniq -c
The output will have lines of the form number_of_matches integer_from_the_output. Only integers which appeared at least once will be listed.
To convert this format (inefficiently) into your preferred format with 108 lines, each containing the count for the integer corresponding to the line number, replace the ... | sort | uniq -c part with the following command:
... | cat - <(seq 100''000''000) | LC_ALL=C sort -nS40% | LC_ALL=C uniq -c | awk '{$1--;$2=""}1'
This assumes that all the generated integers are between 1 and 108 inclusive. The result gets mangled if any other values appear more than once.
Hash Map
If you want to go with the hash map, the simplest implementation would probably be an awk script:
for i in {1..10000}; do
Rscript sample.R n_samples n_max
done | awk '{a[$0]++} END {for (ln=1; ln<=100000000; ln++) print int(a[ln])}'
However, I'm unsure whether this is such a good idea. The hash map could allocate much more memory than the actual data requires and is probably slow for that many entries.
Also, your awk implementation has to support large numbers. 32-bit integers are not sufficient. If the entire output is just the same integer repeated over and over again you can get a up to ...
104 iterations * 107 occurrences / iteration = 104+7 occurrences = 1011 occurrences
... of that integer. To store the maximal count of 1011 you need at least 37 bits > log2(1011) bits.
GNU awk 5 on a 64-bit system seems to handle numbers of that size.
Faster Approach
Counting occurrences in a data structure is a good idea. However, a hash map is overkill as you have "only" 108 possible values as output. Therefore, you can use an array with 108 entries of 64-bit counters. The array would use ...
64 bit * 108 = 8 Byte * 102+6 = 800 MByte
... of memory. I think 800 MByte should be free even on old PCs and Laptops from 10 years ago.
To implement this approach, use a "normal" programming language of your choice. Bash is not the right tool for this job. You can use bash to pipe the output of the loop into your program. Alternatively, you can execute the for loop directly in your program.

Solved: Grep and Dynamically Truncate at Same Time

Given the following:
for(condition which changes $z)
aptitude show $z | grep -E 'Uncompressed Size: |x' | sed s/Uncompressed Size: //";
done
That means 3 items are outputting to screen ($Z, Uncompressed Size, x).
I want all of that to fit on one line, and a line I deem is = 100 characters.
So, ($Z, Uncompressed Size, x) must fit on one line. But X is very long and will have to be truncated. So there is a requirement to add "used" characters by $z and Uncompressed Size, so that x can be truncated dynamically. I love scripting and being able to do this I deem an absolute must. Needless to say all 3 items being output to screen change hence the characters of the first two outputs must be calculated to subtract from the allowed characters for x, and sum of all characters between all 3 items cannot exceed 100 characters.
sed 's/.//5g'
Lmao, sometimes I wish I thought in simpler terms; complicated description + simple solution = simple problem over complicated by interpreter.
Thank you, Barmar
That only leaves sed (100 - amount of characters used by $z which is this function: ${#z}

Big data sort and search

I have two files of data, 100 char lines each. File A: 108 lines, file B: 106 lines. And I need to find all the strings from file B that are not in file A.
At first I was thinking feeding both files to mysql, but it looks like it won't ever finish creating an unique key on 108 records.
I'm waiting for your suggestions on this.
You can perform this operation without a database. The key is to reduce the size of A, since A is much larger than B. Here is how to do this:
Calculate 64-bit hashes using a decent hash function for the strings in the B file. Store these in memory (in a hash table), which you can do because B is small. Then hash all of the strings in your A file, line by line, and see if each one matches a hash for your B file. Any lines with matching hashes (to one from B), should be stored in a file C.
When this process is complete file C will have the small subset of A of potentially matching strings (to B). Now you have a much smaller file C that you need to compare lines of B with. This reduces the problem to a problem where you can actually load all of the lines from C into memory (as a hash table) and compare each line of B to see if it is in C.
You can slightly improve on #michael-goldshteyn's answer (https://stackoverflow.com/a/3926745/179529). Since you need to find all the strings in B that are not in A, you can remove any item from the Hash Table of the elements of B, when you compare and find a match for it with the elements in A. The elements that will remain in the Hash Table are the elements that were not found in file A.
For the sizes you mention you should be able to keep all of B in memory at once, so you could do a simplified version of Goldshteyn's answer; something like this in python:
#!/usr/bin/python3
import sys
if __name__=='__main__':
b = open(sys.argv[2],'r')
bs = set()
for l in b:
bs.add(l.strip())
b.close()
a = open(sys.argv[1],'r')
for l in a:
l = l.strip()
if l in bs:
bs.remove(l)
for x in bs:
print(x)
I've tested this on two files of 10^5 and 10^7 in size with ~8 chars per line on an atom processor. Output from /usr/bin/time:
25.15user 0.27system 0:25.80elapsed 98%CPU (0avgtext+0avgdata 56032maxresident)k
0inputs+0outputs (0major+3862minor)pagefaults 0swaps
60298 60298 509244

How to shuffle the lines in a file without reading the whole file in advance?

What's a good algorithm to shuffle the lines in a file without reading the whole file in advance?
I guess it would look something like this: Start reading the file line by line from the beginning, storing the line at each point and decide if you want to print one of the lines stored so far (and then remove from storage) or do nothing and proceed to the next line.
Can someone verify / prove this and/or maybe post working (perl, python, etc.) code?
Related questions, but not looking at memory-efficient algorithms:
How can I shuffle the lines of a text file on the Unix command line or in a shell script?
How can I randomize the lines in a file using standard tools on Red Hat Linux?
How can I print the lines in STDIN in random order in Perl?
I cannot think of a way to randomly do the entire file without somehow maintaining a list of what has already been written. I think if I had to do a memory efficient shuffle, I would scan the file, building a list of offsets for the new lines. Once I have this list of new line offsets, I would randomly pick one of them, write it to stdout, and then remove it from the list of offsets.
I am not familiar with perl, or python, but can demonstrate with php.
<?php
$offsets = array();
$f = fopen("file.txt", "r");
$offsets[] = ftell($f);
while (! feof($f))
{
if (fgetc($f) == "\n") $offsets[] = ftell($f);
}
shuffle($offsets);
foreach ($offsets as $offset)
{
fseek($f, $offset);
echo fgets($f);
}
fclose($f);
?>
The only other option I can think of, if scanning the file for new lines is absolutely unacceptable, would be (I am not going to code this one out):
Determine the filesize
Create a list of offsets and lengths already written to stdout
Loop until bytes_written == filesize
Seek to a random offset that is not already in your list of already written values
Back up from that seek to the previous newline or start of file
Display that line, and add it to the list of offsets and lengths written
Go to 3.
The following algorithm is linear in the number of lines in your input file.
Preprocessing:
Find n (the total number of lines) by scanning for newlines (or whatever) but store the character number signifying the beginning and end of each line. So you'd have 2 vectors, say, s and e of size n where characters numbering from s[i] to e[i] in your input file is the i th line. In C++ I'd use vector.
Randomly permute a vector of integers from 1 to n (in C++ it would be random_shuffle) and store it in a vector, say, p (e.g. 1 2 3 4 becomes p = [3 1 4 2]). This implies that line i of the new file is now line p[i] in the original file (i.e. in the above example, the 1st line of the new file is the 3rd line of the original file).
Main
Create a new file
Write the first line in the new file by reading the text in the original file between s[p[0]] and e[p[0]] and appending it to the new file.
Continue as in step 2 for all the other lines.
So the overall complexity is linear in the number of lines (since random_shuffle is linear) if you assume read/write & seeking in a file (incrementing the file pointer) are all constant time operations.
You can create an array for N strings and read the first N lines of the file into this array. For the rest you read one line, select by random one of the lines from the array, and replace the this string by the newly read string. Also you write out the string from the array to the output file. This has the advantage that you don't need to iterate over the file twice. The disadvantage is that it will not create a very random output file, especially when N is low (for example this algorithm can't move the last line more than N lines up in the output.)
Edit
Just an example in python:
import sys
import random
CACHE_SIZE = 23
lines = {}
for l in sys.stdin: # you can replace sys.stdin with xrange(200) to get a test output
i = random.randint(0, CACHE_SIZE-1)
old = lines.get(i)
if old:
print old,
lines[i] = l
for ignored, p in lines.iteritems():
print p,

Resources