randomly selecting rows with header intact - bash

How can I select 500 rows randomly from a text file, but make sure that the header is always included. My file looks like
Col1 Col2
A B
C D
etc. And the first line is the header. I tried sort -r filename|head -n 500 but that does not ensure that the header is always included. Thanks

I'd say
{ IFS= read -r head; echo "$head"; shuf | head -n 500; } < file
Upon further reflection, that may not be the best solution: it shuffles the file, so the randomly selected lines are out of order. This may not matter
If it does matter, here's a technique:
sed -n "$({ echo 1; seq $(wc -l <file) | sed 1d | shuf | head -n 500 | sort -n; } | sed 's/$/p/')" file
The command substitution prints out a sed program to print 500 random lines from the file, but they are in order:
echo 1 => the header is always included
seq $(wc -l <file) => print the numbers from 1 to the number of lines in the file
sed 1d => delete the first line ("1") - don't want the header twice
shuf => shuffle the line numbers
head -n 500 => take 500 of them
sort -n => sort the numbers numerically
sed 's/$/p/' => add a "p" to the end of each line
Then, the outer sed program does something like
sed -n "1p; 5p; 199p; 201p; ... 4352p" file

Solution:
filename=file.txt
lines=500
head -1 $filename
tail -n+2 $filename | shuf | head -n $((lines-1))
Explanation.
This command prints header only:
head -1 $filename
This command prints everything but header:
tail -n+2 $filename
Since one line (header) was already printed, there is only 500-1 lines left to be printed:
head -n $((lines-1))
Also, as was mentioned, it's better to use shuf instead of sort -r to shuffle the lines, because sort -r gives you the same order of lines every time.

Related

Get Average of Found Numbers in Each File to Two Decimal Places

I have a script that searches through all files in the directory and pulls the number next to the word <Overall>. I want to now get the average of the numbers from each file, and output the filename next to the average to two decimal places. I've gotten most of it to work except displaying the average. I should say I think it works, I'm not sure if it's pulling all of the instances in the file, and I'm definitely not sure if it's finding the average, it's hard to tell without the precision. I'm also sorting by the average at the end. I'm trying to use awk and bc to get the average, there's probably a better method.
What I have now:
path="/home/Downloads/scores/*"
(for i in $path
do
echo `basename $i .dat` `grep '<Overall>' < $i |
head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc`
done) | sort -g -k 2
The output i get is:
John 4
Lucy 4
Matt 5
Sara 5
But it shouldn't be an integer and it should be to two decimal places.
Additionally, the files I'm searching through look like this:
<Student>John
<Math>2
<English>3
<Overall>5
<Student>Richard
<Math>2
<English>2
<Overall>4
In general, your script does not extract all numbers from each file, but only the first digit of the first number. Consider the following file:
<Overall>123 ...
<Overall>4 <Overall>56 ...
<Overall>7.89 ...
<Overall> 0 ...
The command grep '<Overall>' | head -c 10 | tail -c 1 will only extract 1.
To extract all numbers preceded by <Overall> you can use grep -Eo '<Overall> *[0-9.]*' | grep -o '[0-9.]*' or (depending on your version) grep -Po '<Overall>\s*\K[0-9.]*'.
To compute the average of these numbers you can use your awk command or specialized tools like ... | average (from the package num-utils) or ... | datamash mean 1.
To print numbers with two decimal places (that is 1.00 instead of 1 and 2.35 instead of 2.34567) you can use printf.
#! /bin/bash
path=/home/Downloads/scores/
for i in "$path"/*; do
avg=$(grep -Eo '<Overall> *[0-9.]*' "$file" | grep -o '[0-9.]*' |
awk '{total += $1} END {print total/NR}')
printf '%s %.2f\n' "$(basename "$i" .dat)" "$avg"
done |
sort -g -k 2
Sorting works only if file names are free of whitespace (like space, tab, newline).
Note that you can swap out the two lines after avg=$( with any method mentioned above.
You can use a sed command and retrieve the values to calculate their average with bc:
# Read the stdin, store the value in an array and perform a bc call
function avg() { mapfile -t l ; IFS=+ bc <<< "scale=2; (${l[*]})/${#l[#]}" ; }
# Browse the .dat files, then display for each file the average
find . -iname "*.dat" |
while read f
do
f=${f##*/} # Remove the dirname
# Echoes the file basename and a tabulation (no newline)
echo -en "${f%.dat}\t"
# Retrieves all the "Overall" values and passes them to our avg function
sed -E -e 's/<Overall>([0-9]+)/\1/' "$f" | avg
done
Output example:
score-2 1.33
score-3 1.33
score-4 1.66
score-5 .66
The pipeline head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc needs improvement.
head -c 10 | tail -c 1 leaves only the 10th character of the first Overall line from each file; better drop that.
Instead, use awk to "remove" the prefix <Overall> and extract the number; we can do this by using <Overall> for the input field separator.
Also use awk to format the result to two decimal places.
Since awk did the job, there's no more need for bc; drop it.
The above pipeline becomes awk -F'<Overall>' '{total += $2} END {printf "%.2f\n", total/NR}'.
Don't miss to keep the ` after it.

How do I check if the number of lines in a set of files is not equal to a certain number? (Bash)

I want to detect which one of my files is corrupt, and by corrupt it means that the file does not have 102 lines in it. I want the for loop that I'm writing to output a error message giving me the file name of the corrupt files. I have files named ethane1.log ethane2.log ethane3.log ... ethane10201.log .
for j in {1..10201}
do
if [ ! (grep 'C 2- C 5' ethane$j.log | cut -c 22- | tail -n +2 | awk '{for (i=1;i<=NF;i++) print $i}'; done | wc -l) == 102]
then echo "Ethane$j.log is corrupt."
fi
done
When the file is not corrupt, the input:
grep 'C 2- C 5' ethane$j.log | cut -c 22- | tail -n +2 | awk '{for (i=1;i<=NF;i++) print $i}'; done | wc -l
returns:
102
Or else it is another number.
Only thing is, I'm not sure the syntax for the if construct (How to create a variable from the 102 output of wc -l, and then how to check if it is equal to or not equal to 102.)
A sample output would be:
Ethane100.log is corrupt.
Ethane2010.log is corrupt.
Ethane10201.log is corrupt.
To count lines, use wc -l:
wc -l ethane*.log | grep -v '^ *102 ' | head -n-1
grep -v removes matching lines
^ matches the start of a line
space* matches any number of spaces (0 or more)
head removes some trailing lines
-n-1 removes the last line (the total)
Using gawk
awk 'ENDFILE{if(NR!=102)print NR,FILENAME}' ethane*.log
At the end of each file, checks the number of lines isn't 102 and prints the number of lines and the filename.

Simple diff/patch script for sorted unique file

How could I write a simple diff resp. patch script for applying additions and deletions to a list of lines in a file?
This could be a original file (it is sorted and each line is unique):
a
b
d
a simple patch file could look like this (or somehow as simple):
+ c
+ e
- b
The resulting file should look like (or in any other order, since sort could be applied anyways):
a
c
d
e
The normal patch formats can not be used since they include context, which might alter in this case.
Bash alternatives that read input files only once:
To generate patch you can:
comm -3 a.txt b.txt | sed 's/^\t/+ /;t;s/^/- /'
Because comm delimeters outputs from different files using tab, we can use that tab to detect if line should be added or removed.
To apply patch you can:
{ <patch.txt tee >(grep '^+ ' | cut -c3- >&5) |
grep '^- ' | cut -c3- | comm -13 - a.txt; } 5> >(cat)
The tee splits the input, that is the patch file, into two streams. The first part has + filtered and is outputted to file descriptor 5. The file descriptor 5 is opened to just >(cat) so it is just outputted on stdout. The second part has the minus - filtered and it is joined with a.txt and outputted. Because output should be line buffered, it should work.
A shell solution using comm, awk, and grep to apply such a patch would be:
A=a.txt B=b.txt P=patch.txt; { grep '^-' $P | cut -c 3- | comm -23 $A - ; grep '^+' $P | cut -c 3- } | sort -u > $B
to generate the patch file would be:
A=a.txt B=b.txt P=patch.txt; { comm -13 $A $B | awk '{print "+ " $0}' ; comm -23 $A $B | awk '{print "- " $0}' } > $P
Since nobody could give me an answer, I've created a small python script, which does exactly this job. https://github.com/white-gecko/simplepatch
To apply such a patch call it with (where outfile.txt is generated)
./simplepatch.py -m patch -i infile.txt -p patchfile.txt -o outfile.txt
To generate a patch/diff call it with (where patchfile.txt is generated)
./simplepatch.py -m diff -i infile.txt -o outfile.txt -p patchfile.txt

How to sort groups of lines?

In the following example, there are 3 elements that have to be sorted:
"[aaa]" and the 4 lines (always 4) below it form a single unit.
"[kkk]" and the 4 lines (always 4) below it form a single unit.
"[zzz]" and the 4 lines (always 4) below it form a single unit.
Only groups of lines following this pattern should be sorted; anything before "[aaa]" and after the 4th line of "[zzz]" must be left intact.
from:
This sentence and everything above it should not be sorted.
[zzz]
some
random
text
here
[aaa]
bla
blo
blu
bli
[kkk]
1
44
2
88
And neither should this one and everything below it.
to:
This sentence and everything above it should not be sorted.
[aaa]
bla
blo
blu
bli
[kkk]
1
44
2
88
[zzz]
some
random
text
here
And neither should this one and everything below it.
Maybe not the fastest :) [1] but it will do what you want, I believe:
for line in $(grep -n '^\[.*\]$' sections.txt |
sort -k2 -t: |
cut -f1 -d:); do
tail -n +$line sections.txt | head -n 5
done
Here's a better one:
for pos in $(grep -b '^\[.*\]$' sections.txt |
sort -k2 -t: |
cut -f1 -d:); do
tail -c +$((pos+1)) sections.txt | head -n 5
done
[1] The first one is something like O(N^2) in the number of lines in the file, since it has to read all the way to the section for each section. The second one, which can seek immediately to the right character position, should be closer to O(N log N).
[2] This takes you at your word that there are always exactly five lines in each section (header plus four following), hence head -n 5. However, it would be really easy to replace that with something which read up to but not including the next line starting with a '[', in case that ever turns out to be necessary.
Preserving start and end requires a bit more work:
# Find all the sections
mapfile indices < <(grep -b '^\[.*\]$' sections.txt)
# Output the prefix
head -c+${indices[0]%%:*} sections.txt
# Output sections, as above
for pos in $(printf %s "${indices[#]}" |
sort -k2 -t: |
cut -f1 -d:); do
tail -c +$((pos+1)) sections.txt | head -n 5
done
# Output the suffix
tail -c+$((1+${indices[-1]%%:*})) sections.txt | tail -n+6
You might want to make a function out of that, or a script file, changing sections.txt to $1 throughout.
Assuming that other lines do not contain a [ in them:
header=`grep -n 'This sentence and everything above it should not be sorted.' sortme.txt | cut -d: -f1`
footer=`grep -n 'And neither should this one and everything below it.' sortme.txt | cut -d: -f1`
head -n $header sortme.txt #print header
head -n $(( footer - 1 )) sortme.txt | tail -n +$(( header + 1 )) | tr '\n[' '[\n' | sort | tr '\n[' '[\n' | grep -v '^\[$' #sort lines between header & footer
#cat sortme.txt | head -n $(( footer - 1 )) | tail -n +$(( header + 1 )) | tr '\n[' '[\n' | sort | tr '\n[' '[\n' | grep -v '^\[$' #sort lines between header & footer
tail -n +$footer sortme.txt #print footer
Serves the purpose.
Note that the main sort work is done by 4th command only. Other lines are to reserve header & footer.
I am also assuming that, between header & first "[section]" there are no other lines.
This might work for you (GNU sed & sort):
sed -i.bak '/^\[/!b;N;N;N;N;s/\n/UnIqUeStRiNg/g;w sort_file' file
sort -o sort_file sort_file
sed -i -e '/^\[/!b;R sort_file' -e 'd' file
sed -i 's/UnIqUeStRiNg/\n/g' file
Sorted file will be in file and original file in file.bak.
This will present all lines beginning with [ and following 4 lines, in sorted order.
UnIqUeStRiNg can be any unique string not containing a newline, e.g. \x00

What's an easy way to trim N lines from the tail of a file (without the use of 'head')?

Say I've got a bunch of files that are all >100 lines long. I'd like to trim off the top 14 lines and the bottom 9 lines, leaving only the lines in the middle. This command will trim off the top fourteen:
cat myfile.txt | tail -n +15
Is there another command I can pipe through to trim off the bottom 9 without explicitly passing the length of the file?
Edited to add: My version of head (Mac OS 10.5) doesn't accept a negative number of lines as a parameter.
This will work on OS X and might be a bit more easily understandable than the sed example:
< myfile.txt tail -n +15 | tail -r | tail -n +10 | tail -r
Of course, if you can get your hands on GNU's version of head, it can be done even more elegantly:
< myfile.txt tail -n +15 | head -n -9
Be aware the tail starts at the nth line while head skips n lines of the input.
You could use sed:
sed -n -e :a -e '1,9!{P;N;D;};N;ba' myfile.txt
You can also use sed for the first 15:
sed '1,15d' myfile.txt
Use a negative number of lines with the head command:
cat myfile.txt | head -n -9
That prints everything except the last 9 lines.
What jbourque said is completely right. He just wasn't too wordy about it:
cat myfile.txt | tail -n +15 | head -n -9
If you can recognize the last 9 lines by a distinctive pattern for the first of those lines, then a simple sed command would do the trick:
sed -e '1,15d' -e '/distinctive-pattern/,$d' $file
If you need a pure numeric offset from the bottom, standard (as opposed to GNU) sed won't help, but ed would:
ed $file <<'!'
1,15d
$-8,$d
w
q
!
This overwrites the original files. You'd have to script where the file is written to if you wanted to avoid that.
The head command should do what you want. It woks just like tail but from the other end of the file.
This should also work, and does things in a single process:
seq 15 |
awk -v N=5 '
{ lines[NR % N] = $0 }
END { i = NR-N+1; if (i<0) i=0; for (; i <= NR; ++i) print lines[i % N] }'
(The seq is just an easy way to produce some test data.)

Resources