Choosing random lines with replacement from an XY file - bash

I have an XY file with over 40000 unique lines of floating numbers. I want to use bootstrap resampling on this file. Bootstrap resampling works as follows: it resamples N random lines (N is the number of the input file)with replacement from the input file. This means the new data set(output) has the same number of lines as the first file and the new dataset can contain some lines multiple times and might not contain some of the original lines at all. I tried shuffling lines using
shuf -n N input > output
and
sort -R input | head -n N > output
, but it seems they don’t implement the replacement.
It is deeply appreciated if somebody could introduce a way to do this using AWK and Shell.

I believe what you are after is the following:
Assume you have an input file input with the following content:
$ seq 10 > input
Then you can get a new randomised file with the same lines and possible repetitions as following:
$ shuf -rn $(wc -l input) input
7
2
9
3
1
7
4
8
7
10
Here we use the -r flag to allow for repetitions.

Related

Explanation to an assignment

I am NOT looking for an answer to this problem. I am having trouble understanding what I should be trying to accomplish in this assignment. I welcome Pseudo code or hints if you would like. but what I really need is an explanation to what I need to be making, and what the output should be/look like. please do not write out a lot of code though I would like to try that on my own.
(()) = notes from me
The assignment is:
a program (prog.exe) ((we are given this program)) that reads 2 integers (m, n) and 1 double (a) from an input data file named input.in. For example, the sample input.in given file contains the values
5 7 1.23456789012345
when you run ./prog.exe the output is a long column of floating-point numbers
in additions to the program, there is a file called ain.in that contains a long column of double precision values.
copy prog.exe and ain.in to working directory
Write a bash script that does that following:
-Runs ./prog.exe for all combonations of
--m=0,1,...,10
--n=0,1,...,5
--a=every value in the file ain.in
-this is essentially a triple nested loop over m,n and the ain.in values
-for each combination of m,n and ain.in value above:
-- generate the appropriate input file input.in
-- run the program and redirect the output to some temporary output file.
--extract the 37th and 51st values from this temporary output file and store these in a file called average.in
-when the 3 nested loops terminate the average.in file should contain a long list of floating point values.
-your script should return the average of the values contained in average.in
HINTS: seq , awk , output direction, will be useful here
thank you to whoever took the time to even read through this.
This is my second bash coding assignment and im still trying to get a grasp on it and a better explanation would be very helpful. thanks again!
this is one way of generating all input combinations without explicit loops
join -j9 <(join -j9 <(seq 0 10) <(seq 0 5)) ain.in | cut -d' ' -f2-
The idea is to write a bash script that will test prog.exe with a variety of input conditions. This means recreating input.in and running prog.exe many times. Each time you run prog.exe, input.in should contain a different three numbers, e.g.,
First run:
0 0 <first line of ain.in>
Second run:
0 0 <second line of ain.in>
. . . last run:
10 5 <last line of ain.in>
You can use seq and for loops to accomplish this.
Then, you need to systematically save the output of each run, e.g.,
./prog.exe > tmp.out
# extract line 37 and 51 and append to average.ln
sed -n '37p; 51p; 51q' tmp.out >> average.ln
Finally, after testing all the combinations, use awk to compute the average of all the lines in average.in.
One-liner inspired by #karakfa:
join -j9 <(join -j9 <(seq 0 10) <(seq 0 5)) ain.in | cut -d' ' -f2- |
sed "s/.*/echo & >input.in;./prog.exe>tmp.out; sed -n '37p;51p;51q' tmp.out/" |
sh | awk '{sum+=$1; n++} END {print sum/n}'

Reverse lines in a file two by two

I'm trying to reverse the lines in a file, but I want to do it two lines by two lines.
For the following input:
1
2
3
4
…
97
98
I would like the following output:
97
98
…
3
4
1
2
I found lots of ways to reverse a file line by line (especially on this topic: How can I reverse the order of lines in a file?).
tac. The simplest. Doesn't seem to have an option for what I want, even if I tried to play around with options -r and -s.
tail -r (not POSIX compliant). Not POSIX compliant, my version doesn't seem to have anything to do that.
Remains three sed formula, and I think a little modification would do the trick. But I'm not even understanding what they're doing, and thus I'm stuck here.
sed '1!G;h;$!d'
sed -n '1!G;h;$p'
sed 'x;1!H;$!d;x'
Any help would be appreciated. I'll try to understand these formula and to give answer to this question by myself.
Okay, I'll bite. In pure sed, we'll have to build the complete output in the hold buffer before printing it (because we see the stuff we want to print first last). A basic template can look like this:
sed 'N;G;h;$!d' filename # Incomplete!
That is:
N # fetch another line, append it to the one we already have in the pattern
# space
G # append the hold buffer to the pattern space.
h # save the result of that to the hold buffer
$!d # and unless the end of the input was reached, start over with the next
# line.
The hold buffer always contains the reversed version of the input processed so far, and the code takes two lines and glues them to the top of that. In the end, it is printed.
This has two problems:
If the number of input lines is odd, it prints only the last line of the file, and
we get a superfluous empty line at the end of the input.
The first is because N bails out if no more lines exist in the output, which happens with an odd number of input lines; we can solve the problem by executing it conditionally only when the end of the input was not yet reached. Just like the $!d above, this is done with $!N, where $ is the end-of-input condition and ! inverts it.
The second is because at the very beginning, the hold buffer contains an empty line that G appends to the pattern space when the code is run for the very first time. Since with $!Nwe don't know if at that point the line counter is 1 or 2, we should inhibit it conditionally on both. This can be done with 1,2!G, where 1,2 is a range spanning from line 1 to line 2, so that 1,2!G will run G if the line counter is not between 1 and 2.
The whole script then becomes
sed '$!N;1,2!G;h;$!d' filename
Another approach is to combine sed with tac, such as
tac filename | sed -r 'N; s/(.*)\n(.*)/\2\n\1/' # requires GNU sed
That is not the shortest possible way to use sed here (you could also use tac filename | sed -n 'h;$!{n;G;};p'), but perhaps easier to understand: Every time a new line is processed, N fetches another line, and the s command swaps them. Because tac feeds us the lines in reverse, this restores pairs of lines to their original order.
The key difference to the first approach is the behavior for an odd number of lines: with the second approach, the first line of the file will be alone without a partner, whereas with the first it'll be the last.
I would go with this:
tac file | while read a && read b; do echo $b; echo $a; done
Here is an awk you can use:
cat file
1
2
3
4
5
6
7
8
awk '{a[NR]=$0} END {for (i=NR;i>=1;i-=2) print a[i-1]"\n"a[i]}' file
7
8
5
6
3
4
1
2
It store all line in an array a, then print it out in reverse, two by two.

Unix: How to sort a dat file and keep original line numbers

I have a large data file containing over a thousand entries. I would like to sort them but maintain the original line numbers. For instance,
1:100
2:120
3:10
4:59
Where the first number is the line number, not saved in the data file, separated by a colon from the real number. I would like to sort it and keep the line numbers bound to their original lines, with an output of:
2:120
1:100
4:59
3:10
If possible, I would like to do this without creating another file, and numbering them by hand is not an option for the data size I'm using.
Given a file test.dat:
100
120
10
59
... the command:
$ cat -n test.dat | sort --key=2 -nr
2 120
1 100
4 59
3 10
... gives the output that you seem to be looking for (though with the fields delimited by tabs, which is easily changed if necessary):
$ cat -n test.dat | sort --key=2 -nr | sed -e's/\t/:/'
2:120
1:100
4:59
3:10

How to split a large file into many small files using bash? [duplicate]

This question already has answers here:
How can I split a large text file into smaller files with an equal number of lines?
(12 answers)
Closed 6 years ago.
I have a file, say all, with 2000 lines, and I hope it can be split into 4 small files with line number 1~500, 501~1000, 1001~1500, 1501~2000.
Perhaps, I can do this using:
cat all | head -500 >small1
cat all | tail -1500 | head -500 >small2
cat all | tail -1000 | head -500 >small3
cat all | tail -500 >small4
But this way involves the calculation of line number, which may cause error when the number of lines is not a good number, or when we want to split the file to too many small files (e.g.: file all with 3241 lines, and we want to split it into 7 files, each with 463 lines).
Is there a better way to do this?
When you want to split a file, use split:
split -l 500 all all
will split the file into several files that each have 500 lines. If you want to split the file into 4 files of roughly the same size, use something like:
split -l $(( $( wc -l < all ) / 4 + 1 )) all all
Look into the split command, it should do what you want (and more):
$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names.
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes[=FROM] use numeric suffixes instead of alphabetic.
FROM changes the start value (default 0).
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines per output file
-n, --number=CHUNKS generate CHUNKS output files. See below
-u, --unbuffered immediately copy input to output with '-n r/...'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
SIZE is an integer and optional unit (example: 10M is 10*1024*1024). Units
are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ... (powers of 1000).
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines
l/K/N output Kth of N to stdout without splitting lines
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
Like the others have already mentioned, you could use split. The complicated command substitution that the accepted answer mentions is not necessary. For reference I'm adding the following commands, which accomplish almost what has been request. Note that when using -n command-line argument to specify the number of chucks, the small* files do not contain exactly 500 lines when using split.
$ seq 2000 > all
$ split -n l/4 --numeric-suffixes=1 --suffix-length=1 all small
$ wc -l small*
583 small1
528 small2
445 small3
444 small4
2000 total
Alternatively, you could use GNU parallel:
$ < all parallel -N500 --pipe --cat cp {} small{#}
$ wc -l small*
500 small1
500 small2
500 small3
500 small4
2000 total
As you can see, this incantation is quite complex. GNU Parallel is actually most-often used for parallelizing pipelines. IMHO a tool worth looking into.

How can I split a large text file into smaller files with an equal number of lines?

I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).
I could do this fairly easily in Python, but I'm wondering if there's any kind of ninja way to do this using Bash and Unix utilities (as opposed to manually looping and counting / partitioning lines).
Have a look at the split command:
$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit
You could do something like this:
split -l 200000 filename
which will create files each with 200000 lines named xaa xab xac ...
Another option, split by size of output file (still splits on line breaks):
split -C 20m --numeric-suffixes input_filename output_prefix
creates files like output_prefix01 output_prefix02 output_prefix03 ... each of maximum size 20 megabytes.
Use the split command:
split -l 200000 mybigfile.txt
Yes, there is a split command. It will split a file by lines or bytes.
$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
SIZE may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
Split the file "file.txt" into 10,000-lines files:
split -l 10000 file.txt
To split a large text file into smaller files of 1000 lines each:
split <file> -l 1000
To split a large binary file into smaller files of 10M each:
split <file> -b 10M
To consolidate split files into a single file:
cat x* > <file>
Split a file, each split having 10 lines (except the last split):
split -l 10 filename
Split a file into 5 files. File is split such that each split has same size (except the last split):
split -n 5 filename
Split a file with 512 bytes in each split (except the last split; use 512k for kilobytes and 512m for megabytes):
split -b 512 filename
Split a file with at most 512 bytes in each split without breaking lines:
split -C 512 filename
Use split:
Split a file into fixed-size pieces, creates output files containing consecutive sections of INPUT (standard input if none is given or INPUT is `-')
Syntax split [options] [INPUT [PREFIX]]
Use:
sed -n '1,100p' filename > output.txt
Here, 1 and 100 are the line numbers which you will capture in output.txt.
split (from GNU coreutils, since version 8.8 from 2010-12-22) includes the following parameter:
-n, --number=CHUNKS generate CHUNKS output files; see explanation below
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines/records
l/K/N output Kth of N to stdout without splitting lines/records
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
Thus, split -n 4 input output. will generate four files (output.a{a,b,c,d}) with the same amount of bytes, but lines might be broken in the middle.
If we want to preserve full lines (i.e. split by lines), then this should work:
split -n l/4 input output.
Related answer: https://stackoverflow.com/a/19031247
You can also use AWK:
awk -vc=1 'NR%200000==0{++c}{print $0 > c".txt"}' largefile
In case you just want to split by x number of lines each file, the given answers about split are OK. But, I am curious about why no one paid attention to the requirements:
"without having to count them" -> using wc + cut
"having the remainder in extra file" -> split does by default
I can't do that without "wc + cut", but I'm using that:
split -l $(expr `wc $filename | cut -d ' ' -f3` / $chunks) $filename
This can be easily added to your .bashrc file functions, so you can just invoke it, passing the filename and chunks:
split -l $(expr `wc $1 | cut -d ' ' -f3` / $2) $1
In case you want just x chunks without remainder in the extra file, just adapt the formula to sum it (chunks - 1) on each file. I do use this approach because usually I just want x number of files rather than x lines per file:
split -l $(expr `wc $1 | cut -d ' ' -f3` / $2 + `expr $2 - 1`) $1
You can add that to a script and call it your "ninja way", because if nothing suites your needs, you can build it :-)
HDFS getmerge small file and split into a proper size.
This method will cause line breaks:
split -b 125m compact.file -d -a 3 compact_prefix
I try to getmerge and split into about 128 MB for every file.
# Split into 128 MB, and judge sizeunit is M or G. Please test before use.
begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $1}' `
sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $2}' `
if [ $sizeunit = "G" ];then
res=$(printf "%.f" `echo "scale=5;$begainsize*8 "|bc`)
else
res=$(printf "%.f" `echo "scale=5;$begainsize/128 "|bc`) # Celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
fi
echo $res
# Split into $res files with a number suffix. Ref: http://blog.csdn.net/microzone/article/details/52839598
compact_file_name=$compact_file"_"
echo "compact_file_name: "$compact_file_name
split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}
Here an example dividing the file "toSplit.txt" into smaller files of 200 lines named "splited00.txt", splited01.txt, ... , "splited25.txt" ...
split -l 200 --numeric-suffixes --additional-suffix=".txt" toSplit.txt splited

Resources