I have a text file with a series of floating point numbers – one per line – like so:
1
0.98
1.21
0.68
0.647
0.1
More specifically: I generate these lines using an awk call.
How would I go about extracting the largest of these numbers in a single call? Bonus points for extracting the top n values.
Try this cat your_filename | sort -n | head -1
Read about head - you can pass number of how many lines you want to display.
Does it solve your problem?
Related
I have an XY file with over 40000 unique lines of floating numbers. I want to use bootstrap resampling on this file. Bootstrap resampling works as follows: it resamples N random lines (N is the number of the input file)with replacement from the input file. This means the new data set(output) has the same number of lines as the first file and the new dataset can contain some lines multiple times and might not contain some of the original lines at all. I tried shuffling lines using
shuf -n N input > output
and
sort -R input | head -n N > output
, but it seems they don’t implement the replacement.
It is deeply appreciated if somebody could introduce a way to do this using AWK and Shell.
I believe what you are after is the following:
Assume you have an input file input with the following content:
$ seq 10 > input
Then you can get a new randomised file with the same lines and possible repetitions as following:
$ shuf -rn $(wc -l input) input
7
2
9
3
1
7
4
8
7
10
Here we use the -r flag to allow for repetitions.
I am NOT looking for an answer to this problem. I am having trouble understanding what I should be trying to accomplish in this assignment. I welcome Pseudo code or hints if you would like. but what I really need is an explanation to what I need to be making, and what the output should be/look like. please do not write out a lot of code though I would like to try that on my own.
(()) = notes from me
The assignment is:
a program (prog.exe) ((we are given this program)) that reads 2 integers (m, n) and 1 double (a) from an input data file named input.in. For example, the sample input.in given file contains the values
5 7 1.23456789012345
when you run ./prog.exe the output is a long column of floating-point numbers
in additions to the program, there is a file called ain.in that contains a long column of double precision values.
copy prog.exe and ain.in to working directory
Write a bash script that does that following:
-Runs ./prog.exe for all combonations of
--m=0,1,...,10
--n=0,1,...,5
--a=every value in the file ain.in
-this is essentially a triple nested loop over m,n and the ain.in values
-for each combination of m,n and ain.in value above:
-- generate the appropriate input file input.in
-- run the program and redirect the output to some temporary output file.
--extract the 37th and 51st values from this temporary output file and store these in a file called average.in
-when the 3 nested loops terminate the average.in file should contain a long list of floating point values.
-your script should return the average of the values contained in average.in
HINTS: seq , awk , output direction, will be useful here
thank you to whoever took the time to even read through this.
This is my second bash coding assignment and im still trying to get a grasp on it and a better explanation would be very helpful. thanks again!
this is one way of generating all input combinations without explicit loops
join -j9 <(join -j9 <(seq 0 10) <(seq 0 5)) ain.in | cut -d' ' -f2-
The idea is to write a bash script that will test prog.exe with a variety of input conditions. This means recreating input.in and running prog.exe many times. Each time you run prog.exe, input.in should contain a different three numbers, e.g.,
First run:
0 0 <first line of ain.in>
Second run:
0 0 <second line of ain.in>
. . . last run:
10 5 <last line of ain.in>
You can use seq and for loops to accomplish this.
Then, you need to systematically save the output of each run, e.g.,
./prog.exe > tmp.out
# extract line 37 and 51 and append to average.ln
sed -n '37p; 51p; 51q' tmp.out >> average.ln
Finally, after testing all the combinations, use awk to compute the average of all the lines in average.in.
One-liner inspired by #karakfa:
join -j9 <(join -j9 <(seq 0 10) <(seq 0 5)) ain.in | cut -d' ' -f2- |
sed "s/.*/echo & >input.in;./prog.exe>tmp.out; sed -n '37p;51p;51q' tmp.out/" |
sh | awk '{sum+=$1; n++} END {print sum/n}'
I'm trying to reverse the lines in a file, but I want to do it two lines by two lines.
For the following input:
1
2
3
4
…
97
98
I would like the following output:
97
98
…
3
4
1
2
I found lots of ways to reverse a file line by line (especially on this topic: How can I reverse the order of lines in a file?).
tac. The simplest. Doesn't seem to have an option for what I want, even if I tried to play around with options -r and -s.
tail -r (not POSIX compliant). Not POSIX compliant, my version doesn't seem to have anything to do that.
Remains three sed formula, and I think a little modification would do the trick. But I'm not even understanding what they're doing, and thus I'm stuck here.
sed '1!G;h;$!d'
sed -n '1!G;h;$p'
sed 'x;1!H;$!d;x'
Any help would be appreciated. I'll try to understand these formula and to give answer to this question by myself.
Okay, I'll bite. In pure sed, we'll have to build the complete output in the hold buffer before printing it (because we see the stuff we want to print first last). A basic template can look like this:
sed 'N;G;h;$!d' filename # Incomplete!
That is:
N # fetch another line, append it to the one we already have in the pattern
# space
G # append the hold buffer to the pattern space.
h # save the result of that to the hold buffer
$!d # and unless the end of the input was reached, start over with the next
# line.
The hold buffer always contains the reversed version of the input processed so far, and the code takes two lines and glues them to the top of that. In the end, it is printed.
This has two problems:
If the number of input lines is odd, it prints only the last line of the file, and
we get a superfluous empty line at the end of the input.
The first is because N bails out if no more lines exist in the output, which happens with an odd number of input lines; we can solve the problem by executing it conditionally only when the end of the input was not yet reached. Just like the $!d above, this is done with $!N, where $ is the end-of-input condition and ! inverts it.
The second is because at the very beginning, the hold buffer contains an empty line that G appends to the pattern space when the code is run for the very first time. Since with $!Nwe don't know if at that point the line counter is 1 or 2, we should inhibit it conditionally on both. This can be done with 1,2!G, where 1,2 is a range spanning from line 1 to line 2, so that 1,2!G will run G if the line counter is not between 1 and 2.
The whole script then becomes
sed '$!N;1,2!G;h;$!d' filename
Another approach is to combine sed with tac, such as
tac filename | sed -r 'N; s/(.*)\n(.*)/\2\n\1/' # requires GNU sed
That is not the shortest possible way to use sed here (you could also use tac filename | sed -n 'h;$!{n;G;};p'), but perhaps easier to understand: Every time a new line is processed, N fetches another line, and the s command swaps them. Because tac feeds us the lines in reverse, this restores pairs of lines to their original order.
The key difference to the first approach is the behavior for an odd number of lines: with the second approach, the first line of the file will be alone without a partner, whereas with the first it'll be the last.
I would go with this:
tac file | while read a && read b; do echo $b; echo $a; done
Here is an awk you can use:
cat file
1
2
3
4
5
6
7
8
awk '{a[NR]=$0} END {for (i=NR;i>=1;i-=2) print a[i-1]"\n"a[i]}' file
7
8
5
6
3
4
1
2
It store all line in an array a, then print it out in reverse, two by two.
I have a large data file containing over a thousand entries. I would like to sort them but maintain the original line numbers. For instance,
1:100
2:120
3:10
4:59
Where the first number is the line number, not saved in the data file, separated by a colon from the real number. I would like to sort it and keep the line numbers bound to their original lines, with an output of:
2:120
1:100
4:59
3:10
If possible, I would like to do this without creating another file, and numbering them by hand is not an option for the data size I'm using.
Given a file test.dat:
100
120
10
59
... the command:
$ cat -n test.dat | sort --key=2 -nr
2 120
1 100
4 59
3 10
... gives the output that you seem to be looking for (though with the fields delimited by tabs, which is easily changed if necessary):
$ cat -n test.dat | sort --key=2 -nr | sed -e's/\t/:/'
2:120
1:100
4:59
3:10
This question already has answers here:
How can I split a large text file into smaller files with an equal number of lines?
(12 answers)
Closed 6 years ago.
I have a file, say all, with 2000 lines, and I hope it can be split into 4 small files with line number 1~500, 501~1000, 1001~1500, 1501~2000.
Perhaps, I can do this using:
cat all | head -500 >small1
cat all | tail -1500 | head -500 >small2
cat all | tail -1000 | head -500 >small3
cat all | tail -500 >small4
But this way involves the calculation of line number, which may cause error when the number of lines is not a good number, or when we want to split the file to too many small files (e.g.: file all with 3241 lines, and we want to split it into 7 files, each with 463 lines).
Is there a better way to do this?
When you want to split a file, use split:
split -l 500 all all
will split the file into several files that each have 500 lines. If you want to split the file into 4 files of roughly the same size, use something like:
split -l $(( $( wc -l < all ) / 4 + 1 )) all all
Look into the split command, it should do what you want (and more):
$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names.
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes[=FROM] use numeric suffixes instead of alphabetic.
FROM changes the start value (default 0).
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines per output file
-n, --number=CHUNKS generate CHUNKS output files. See below
-u, --unbuffered immediately copy input to output with '-n r/...'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
SIZE is an integer and optional unit (example: 10M is 10*1024*1024). Units
are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ... (powers of 1000).
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines
l/K/N output Kth of N to stdout without splitting lines
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
Like the others have already mentioned, you could use split. The complicated command substitution that the accepted answer mentions is not necessary. For reference I'm adding the following commands, which accomplish almost what has been request. Note that when using -n command-line argument to specify the number of chucks, the small* files do not contain exactly 500 lines when using split.
$ seq 2000 > all
$ split -n l/4 --numeric-suffixes=1 --suffix-length=1 all small
$ wc -l small*
583 small1
528 small2
445 small3
444 small4
2000 total
Alternatively, you could use GNU parallel:
$ < all parallel -N500 --pipe --cat cp {} small{#}
$ wc -l small*
500 small1
500 small2
500 small3
500 small4
2000 total
As you can see, this incantation is quite complex. GNU Parallel is actually most-often used for parallelizing pipelines. IMHO a tool worth looking into.