Dieharder test with binary file - random

I have a binary file containing 10^8 binary digits. I want to give it input as 100 sequence each of 1000000 bits in dieharder test .
I have given the command as
dieharder -g 201 -f pseudoseq202.bin -a
But all the tests fails. I think I need to specify the sequence length. Please help me how to specify the sequence length.
for the command
dieharder -g 202 -f pseudoseq202.bin -a
it is giving
file_input(): Error: Wrong number of fields: format is 'fieldname: value'
Please help how to perform dieharder test on binary file.

Related

Cannot sort VCF with bcftools due to invalid input

I am trying to compress & index a VCF file and am facing several issues.
When I use bgzip/tabix, it throws an error saying it cannot be indexed due to some unsorted values.
# code used to bgzip and tabix
bgzip -c fn.vcf > fn.vcf.gz
tabix -p vcf fn.vcf.gz
# below is the error returnd
[E::hts_idx_push] Unsorted positions on sequence #1: 115352924 followed by 115352606
tbx_index_build failed: fn.vcf.gz
When I use bcftools sort to sort this VCF to tackle #1, it throws an error due to invalid entries.
# code used to sort
bcftools sort -O z --output-file fn.vcf.gz fn.vcf
# below is the error returned
Writing to /tmp/bcftools-sort.YSrhjT
[W::vcf_parse_format] Extreme FORMAT/AD value encountered and set to missing at chr12:115350908
[E::vcf_parse_format] Invalid character '\x0F' in 'GT' FORMAT field at chr12:115352482
Error encountered while parsing the input
Cleaning
I've tried sorting using linux commands to get around #2. However, when I run the below code, the size of fout.vcf is almost half of fin.vcf, indicating something might be going wrong.
grep "^#" fin.vcf > fout.vcf
grep -v "^#" fin.vcf | sort -k1,1V -k2,2n >> fout.vcf
Please let me know if you have any advice regarding:
How I could sort/fix the problematic inputs in my VCF in a safe & feasible way. (The file is 340G so I cannot simply open the file and edit.)
Why my linux sort might be behaving in an odd way. (i.e. returning file much smaller than the original.)
Any comments or suggestions are appreciated!
Try this
mkdir tmp ##1 create a tmp folder in your working directory
tmp=/yourpath/ ##2 assign the tmp folder
bcftools sort file.vcf -T ./tmp -Oz -o file.vcf.gz
you can index your file after sorting your file
bcftools index file.vcf.gz

How to test (pseudo)random sequence of bits with dieharder?

I'm thinking to perform the Dieharder tests for some number sequences. For now, my code is generating just ones and zeroes. In general, it will generate numbers between a and b.
How shall I format the input text for dieharder in order to test my sequence of ones and zeroes? Can I write like this
type: d
count: 100
numbit: 1
1
0
1
1
...
?
Edit: I figured out how to use Dieharder Suite on Linux and now I'm testing sequences of numbers with:
dieharder -f totest1M97.input -a
Where totest1M97.input is a simple text file with numbers as follows:
1
4
9
5
78
46
...
Still, wondering if the size of input is enough to feed Dieharder. I've read in the documentation that you can test a generator, such as urandom in Linux, with:
cat /dev/urandom | dieharder -a -g 200
How many numbers (input size) is enough for Dieharder to be happy?

How to split files in PL/SQL based on fixed size

I am using Oracle's UTL_FILE package to generate some files. The file names have a certain format like <name>_<date>_<time>_<sequence> where sequence starts from 000.
Now I want to split the files if the original file is greater than x Mb. In that case I will get (in case of 2 files) :
<name>_<date>_<time>_001
<name>_<date>_<time>_002
where 001 is 10 Mb (max) and 002 is < 10 Mb.
The only way I see to do this is to count every line in bytes which will be written by the UTL_FILE.put command and then to decide whether to write more or to spilt.
This seems me very CPU consuming process.
Is there a way to do this differently in PL/SQL?
I don't have enough badges to comment. Hence answering.
That's interesting and challenging. But why do you want to do this in PL/SQL only? You can easily achieve this task by writing a shell script.
Let us say the file name is File1 and the size is 5.6GB. Then the file should split into 3 files and the naming of the files should be File1, File2, File3
You can use du -BG <file> to get the size in GB.
size=$(du -BG your_file | cut -dG -f1)
then
[ $size -ge 3 ] && split -d -b2G your_file file
output will be file00 file01

How to split a large file into many small files using bash? [duplicate]

This question already has answers here:
How can I split a large text file into smaller files with an equal number of lines?
(12 answers)
Closed 6 years ago.
I have a file, say all, with 2000 lines, and I hope it can be split into 4 small files with line number 1~500, 501~1000, 1001~1500, 1501~2000.
Perhaps, I can do this using:
cat all | head -500 >small1
cat all | tail -1500 | head -500 >small2
cat all | tail -1000 | head -500 >small3
cat all | tail -500 >small4
But this way involves the calculation of line number, which may cause error when the number of lines is not a good number, or when we want to split the file to too many small files (e.g.: file all with 3241 lines, and we want to split it into 7 files, each with 463 lines).
Is there a better way to do this?
When you want to split a file, use split:
split -l 500 all all
will split the file into several files that each have 500 lines. If you want to split the file into 4 files of roughly the same size, use something like:
split -l $(( $( wc -l < all ) / 4 + 1 )) all all
Look into the split command, it should do what you want (and more):
$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names.
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes[=FROM] use numeric suffixes instead of alphabetic.
FROM changes the start value (default 0).
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines per output file
-n, --number=CHUNKS generate CHUNKS output files. See below
-u, --unbuffered immediately copy input to output with '-n r/...'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
SIZE is an integer and optional unit (example: 10M is 10*1024*1024). Units
are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ... (powers of 1000).
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines
l/K/N output Kth of N to stdout without splitting lines
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
Like the others have already mentioned, you could use split. The complicated command substitution that the accepted answer mentions is not necessary. For reference I'm adding the following commands, which accomplish almost what has been request. Note that when using -n command-line argument to specify the number of chucks, the small* files do not contain exactly 500 lines when using split.
$ seq 2000 > all
$ split -n l/4 --numeric-suffixes=1 --suffix-length=1 all small
$ wc -l small*
583 small1
528 small2
445 small3
444 small4
2000 total
Alternatively, you could use GNU parallel:
$ < all parallel -N500 --pipe --cat cp {} small{#}
$ wc -l small*
500 small1
500 small2
500 small3
500 small4
2000 total
As you can see, this incantation is quite complex. GNU Parallel is actually most-often used for parallelizing pipelines. IMHO a tool worth looking into.

How can I split a large text file into smaller files with an equal number of lines?

I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).
I could do this fairly easily in Python, but I'm wondering if there's any kind of ninja way to do this using Bash and Unix utilities (as opposed to manually looping and counting / partitioning lines).
Have a look at the split command:
$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit
You could do something like this:
split -l 200000 filename
which will create files each with 200000 lines named xaa xab xac ...
Another option, split by size of output file (still splits on line breaks):
split -C 20m --numeric-suffixes input_filename output_prefix
creates files like output_prefix01 output_prefix02 output_prefix03 ... each of maximum size 20 megabytes.
Use the split command:
split -l 200000 mybigfile.txt
Yes, there is a split command. It will split a file by lines or bytes.
$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
SIZE may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
Split the file "file.txt" into 10,000-lines files:
split -l 10000 file.txt
To split a large text file into smaller files of 1000 lines each:
split <file> -l 1000
To split a large binary file into smaller files of 10M each:
split <file> -b 10M
To consolidate split files into a single file:
cat x* > <file>
Split a file, each split having 10 lines (except the last split):
split -l 10 filename
Split a file into 5 files. File is split such that each split has same size (except the last split):
split -n 5 filename
Split a file with 512 bytes in each split (except the last split; use 512k for kilobytes and 512m for megabytes):
split -b 512 filename
Split a file with at most 512 bytes in each split without breaking lines:
split -C 512 filename
Use split:
Split a file into fixed-size pieces, creates output files containing consecutive sections of INPUT (standard input if none is given or INPUT is `-')
Syntax split [options] [INPUT [PREFIX]]
Use:
sed -n '1,100p' filename > output.txt
Here, 1 and 100 are the line numbers which you will capture in output.txt.
split (from GNU coreutils, since version 8.8 from 2010-12-22) includes the following parameter:
-n, --number=CHUNKS generate CHUNKS output files; see explanation below
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines/records
l/K/N output Kth of N to stdout without splitting lines/records
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
Thus, split -n 4 input output. will generate four files (output.a{a,b,c,d}) with the same amount of bytes, but lines might be broken in the middle.
If we want to preserve full lines (i.e. split by lines), then this should work:
split -n l/4 input output.
Related answer: https://stackoverflow.com/a/19031247
You can also use AWK:
awk -vc=1 'NR%200000==0{++c}{print $0 > c".txt"}' largefile
In case you just want to split by x number of lines each file, the given answers about split are OK. But, I am curious about why no one paid attention to the requirements:
"without having to count them" -> using wc + cut
"having the remainder in extra file" -> split does by default
I can't do that without "wc + cut", but I'm using that:
split -l $(expr `wc $filename | cut -d ' ' -f3` / $chunks) $filename
This can be easily added to your .bashrc file functions, so you can just invoke it, passing the filename and chunks:
split -l $(expr `wc $1 | cut -d ' ' -f3` / $2) $1
In case you want just x chunks without remainder in the extra file, just adapt the formula to sum it (chunks - 1) on each file. I do use this approach because usually I just want x number of files rather than x lines per file:
split -l $(expr `wc $1 | cut -d ' ' -f3` / $2 + `expr $2 - 1`) $1
You can add that to a script and call it your "ninja way", because if nothing suites your needs, you can build it :-)
HDFS getmerge small file and split into a proper size.
This method will cause line breaks:
split -b 125m compact.file -d -a 3 compact_prefix
I try to getmerge and split into about 128 MB for every file.
# Split into 128 MB, and judge sizeunit is M or G. Please test before use.
begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $1}' `
sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $2}' `
if [ $sizeunit = "G" ];then
res=$(printf "%.f" `echo "scale=5;$begainsize*8 "|bc`)
else
res=$(printf "%.f" `echo "scale=5;$begainsize/128 "|bc`) # Celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
fi
echo $res
# Split into $res files with a number suffix. Ref: http://blog.csdn.net/microzone/article/details/52839598
compact_file_name=$compact_file"_"
echo "compact_file_name: "$compact_file_name
split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}
Here an example dividing the file "toSplit.txt" into smaller files of 200 lines named "splited00.txt", splited01.txt, ... , "splited25.txt" ...
split -l 200 --numeric-suffixes --additional-suffix=".txt" toSplit.txt splited

Resources