How to sort groups of lines? - bash

In the following example, there are 3 elements that have to be sorted:
"[aaa]" and the 4 lines (always 4) below it form a single unit.
"[kkk]" and the 4 lines (always 4) below it form a single unit.
"[zzz]" and the 4 lines (always 4) below it form a single unit.
Only groups of lines following this pattern should be sorted; anything before "[aaa]" and after the 4th line of "[zzz]" must be left intact.
from:
This sentence and everything above it should not be sorted.
[zzz]
some
random
text
here
[aaa]
bla
blo
blu
bli
[kkk]
1
44
2
88
And neither should this one and everything below it.
to:
This sentence and everything above it should not be sorted.
[aaa]
bla
blo
blu
bli
[kkk]
1
44
2
88
[zzz]
some
random
text
here
And neither should this one and everything below it.

Maybe not the fastest :) [1] but it will do what you want, I believe:
for line in $(grep -n '^\[.*\]$' sections.txt |
sort -k2 -t: |
cut -f1 -d:); do
tail -n +$line sections.txt | head -n 5
done
Here's a better one:
for pos in $(grep -b '^\[.*\]$' sections.txt |
sort -k2 -t: |
cut -f1 -d:); do
tail -c +$((pos+1)) sections.txt | head -n 5
done
[1] The first one is something like O(N^2) in the number of lines in the file, since it has to read all the way to the section for each section. The second one, which can seek immediately to the right character position, should be closer to O(N log N).
[2] This takes you at your word that there are always exactly five lines in each section (header plus four following), hence head -n 5. However, it would be really easy to replace that with something which read up to but not including the next line starting with a '[', in case that ever turns out to be necessary.
Preserving start and end requires a bit more work:
# Find all the sections
mapfile indices < <(grep -b '^\[.*\]$' sections.txt)
# Output the prefix
head -c+${indices[0]%%:*} sections.txt
# Output sections, as above
for pos in $(printf %s "${indices[#]}" |
sort -k2 -t: |
cut -f1 -d:); do
tail -c +$((pos+1)) sections.txt | head -n 5
done
# Output the suffix
tail -c+$((1+${indices[-1]%%:*})) sections.txt | tail -n+6
You might want to make a function out of that, or a script file, changing sections.txt to $1 throughout.

Assuming that other lines do not contain a [ in them:
header=`grep -n 'This sentence and everything above it should not be sorted.' sortme.txt | cut -d: -f1`
footer=`grep -n 'And neither should this one and everything below it.' sortme.txt | cut -d: -f1`
head -n $header sortme.txt #print header
head -n $(( footer - 1 )) sortme.txt | tail -n +$(( header + 1 )) | tr '\n[' '[\n' | sort | tr '\n[' '[\n' | grep -v '^\[$' #sort lines between header & footer
#cat sortme.txt | head -n $(( footer - 1 )) | tail -n +$(( header + 1 )) | tr '\n[' '[\n' | sort | tr '\n[' '[\n' | grep -v '^\[$' #sort lines between header & footer
tail -n +$footer sortme.txt #print footer
Serves the purpose.
Note that the main sort work is done by 4th command only. Other lines are to reserve header & footer.
I am also assuming that, between header & first "[section]" there are no other lines.

This might work for you (GNU sed & sort):
sed -i.bak '/^\[/!b;N;N;N;N;s/\n/UnIqUeStRiNg/g;w sort_file' file
sort -o sort_file sort_file
sed -i -e '/^\[/!b;R sort_file' -e 'd' file
sed -i 's/UnIqUeStRiNg/\n/g' file
Sorted file will be in file and original file in file.bak.
This will present all lines beginning with [ and following 4 lines, in sorted order.
UnIqUeStRiNg can be any unique string not containing a newline, e.g. \x00

Related

How to use awk to select text from a file starting from a line number until a certain string

I have this file where I want to read it starting from a certain line number, until a string. I already used
awk "NR>=$LINE && NR<=$((LINE + 121)) {print}" db_000022_model1.dlg
to read from a specific line until and incremented line number, but right now I need to make it stop by itself at a certain string in order to be able to use it on other files.
DOCKED: ENDBRANCH 7 22
DOCKED: TORSDOF 3
DOCKED: TER
DOCKED: ENDMDL
I want it to stop after it reaches
DOCKED: ENDMDL
#!/bin/bash
# This script is for extracting the pdb files from a sorted list of scored
# ligands
mkdir top_poses
for d in $(head -20 summary_2.0.sort | cut -d, -f1 | cut -d/ -f1)
do
cd "$d"||continue
# find the cluster with the highest population within the dlg
RUN=$(grep '###*' "$d.dlg" | sort -k10 -r | head -1 | cut -d\| -f3 | sed 's/ //g')
LINE=$(grep -ni "BEGINNING GENETIC ALGORITHM DOCKING $RUN of 100" "$d.dlg" | cut -d: -f1)
echo "$LINE"
# extract the best pose and correct the format
awk -v line="$((LINE + 14))" "NR>=line; /DOCKED: ENDMDL/{exit}" "$d.dlg" | sed 's/^........//' > "$d.pdbqt"
# convert the pdbqt file into pdb
#obabel -ipdbqt $d.pdbqt -opdb -O../top_poses/$d.pdb
cd ..
done
When I try the
awk -v line="$((LINE + 14))" "NR>=line; /DOCKED: ENDMDL/{exit}" "$d.dlg" | sed 's/^........//' > "$d.pdbqt"
Just like that in the shell terminal, it works. But in the script it outputs an empty file.
Depending on your requirements for handling DOCKED: ENDMDL occurring before your target line:
awk -v line="$LINE" 'NR>=line; /DOCKED: ENDMDL/{exit}' db_000022_model1.dlg
or:
awk -v line="$LINE" 'NR>=line{print; if (/DOCKED: ENDMDL/) exit}' db_000022_model1.dlg

Bash - How to count occurences in a column of a .csv file (without awk)

recently i've started to learn bash scripting and im wondering how i can count occurences in a column of a .csv file, the file is structured like this:
DAYS,SOMEVALUE,SOMEVALUE
sunday,something,something
monday,something,something
wednesday,something,something
sunday,something,something
monday,something,something
so my question is: how can i count each time every value of first column (days) appear? In this case the outputs must be:
Sunday : 2
Monday : 2
Wednesday: 1
The first column is named DAYS, so the script has to not take care of the single value DAYS, DAYS is just a way to identify the column.
if possible i want to see a solution without the awk command and without phyton ecc..
Thx guys and sorry for my bad English
Edit: I thought to do this:
count="$( cat "${FILE}" | grep -c "OCCURENCE")"
echo "OCCURENCE": ${count}
Where OCCURENCE is the single values (sunday,monday...)
But this solution is not automatic, i need to make a list of single occurences in the first column of .csv file and put each one on an array and then count each one with the code i written before. I need some help to do this thx
cut -f1 -d, test.csv | tail -n +2 | sort | uniq -c
This gets you this far:
2 monday
2 sunday
1 wednesday
To get your format (Sunday : 1), I think awk would be an easy and clear way (something like awk '{print $2 " : " $1}', but if you really really must, here's a complete non-awk version:
cut -f1 -d, test.csv | uniq -c | tail -n +2 | while read line; do words=($line); echo ${words[1]} : ${words[0]}; done
A variation of #sneep's answer that uses sed to format the result:
cut -f1 -d, /tmp/data | tail -n +2 | sort | uniq -c | sed 's|^ *\([0-9]*\) \(.*\)|\u\2: \1|g'
Output:
Monday: 2
Sunday: 2
Wednesday: 1
The sed is matching:
^ *: Beginning of line and then any number of spaces
\([0-9]*\): Any number of numbers (storing them in a group \1)
: A single space
\(.*\): Any character until the end, storing it in group \2
And replaces the match with:
\u\2: Second group, capitalizing first character
: \1: Colon, space and the first group

How do I check if the number of lines in a set of files is not equal to a certain number? (Bash)

I want to detect which one of my files is corrupt, and by corrupt it means that the file does not have 102 lines in it. I want the for loop that I'm writing to output a error message giving me the file name of the corrupt files. I have files named ethane1.log ethane2.log ethane3.log ... ethane10201.log .
for j in {1..10201}
do
if [ ! (grep 'C 2- C 5' ethane$j.log | cut -c 22- | tail -n +2 | awk '{for (i=1;i<=NF;i++) print $i}'; done | wc -l) == 102]
then echo "Ethane$j.log is corrupt."
fi
done
When the file is not corrupt, the input:
grep 'C 2- C 5' ethane$j.log | cut -c 22- | tail -n +2 | awk '{for (i=1;i<=NF;i++) print $i}'; done | wc -l
returns:
102
Or else it is another number.
Only thing is, I'm not sure the syntax for the if construct (How to create a variable from the 102 output of wc -l, and then how to check if it is equal to or not equal to 102.)
A sample output would be:
Ethane100.log is corrupt.
Ethane2010.log is corrupt.
Ethane10201.log is corrupt.
To count lines, use wc -l:
wc -l ethane*.log | grep -v '^ *102 ' | head -n-1
grep -v removes matching lines
^ matches the start of a line
space* matches any number of spaces (0 or more)
head removes some trailing lines
-n-1 removes the last line (the total)
Using gawk
awk 'ENDFILE{if(NR!=102)print NR,FILENAME}' ethane*.log
At the end of each file, checks the number of lines isn't 102 and prints the number of lines and the filename.

randomly selecting rows with header intact

How can I select 500 rows randomly from a text file, but make sure that the header is always included. My file looks like
Col1 Col2
A B
C D
etc. And the first line is the header. I tried sort -r filename|head -n 500 but that does not ensure that the header is always included. Thanks
I'd say
{ IFS= read -r head; echo "$head"; shuf | head -n 500; } < file
Upon further reflection, that may not be the best solution: it shuffles the file, so the randomly selected lines are out of order. This may not matter
If it does matter, here's a technique:
sed -n "$({ echo 1; seq $(wc -l <file) | sed 1d | shuf | head -n 500 | sort -n; } | sed 's/$/p/')" file
The command substitution prints out a sed program to print 500 random lines from the file, but they are in order:
echo 1 => the header is always included
seq $(wc -l <file) => print the numbers from 1 to the number of lines in the file
sed 1d => delete the first line ("1") - don't want the header twice
shuf => shuffle the line numbers
head -n 500 => take 500 of them
sort -n => sort the numbers numerically
sed 's/$/p/' => add a "p" to the end of each line
Then, the outer sed program does something like
sed -n "1p; 5p; 199p; 201p; ... 4352p" file
Solution:
filename=file.txt
lines=500
head -1 $filename
tail -n+2 $filename | shuf | head -n $((lines-1))
Explanation.
This command prints header only:
head -1 $filename
This command prints everything but header:
tail -n+2 $filename
Since one line (header) was already printed, there is only 500-1 lines left to be printed:
head -n $((lines-1))
Also, as was mentioned, it's better to use shuf instead of sort -r to shuffle the lines, because sort -r gives you the same order of lines every time.

Bash: creating a pipeline to list top 100 words

Ok, so I need to create a command that lists the 100 most frequent words in any given file, in a block of text.
What I have at the moment:
$ alias words='tr " " "\012" <hamlet.txt | sort -n | uniq -c | sort -r | head -n 10'
outputs
$ words
14 the
14 of
8 to
7 and
5 To
5 The
5 And
5 a
4 we
4 that
I need it to output in the following format:
the of to and To The And a we that
((On that note, how would I tell it to print the output in all caps?))
And I need to change it so that I can pipe 'words' to any file, so instead of having the file specified within the pipe, the initial input would name the file & the pipe would do the rest.
Okay, taking your points one by one, though not necessarily in order.
You can change words to use standard input just by removing the <hamlet.txt bit since tr will take its input from standard input by default. Then, if you want to process a specific file, use:
cat hamlet.txt | words
or:
words <hamlet.txt
You can remove the effects of capital letters by making the first part of the pipeline:
tr '[A-Z]' '[a-z]'
which will lower-case your input before doing anything else.
Lastly, if you take that entire pipeline (with the suggested modifications above) and then pass it through a few more commands:
| awk '{printf "%s ", $2}END{print ""}'
This prints the second argument of each line (the word) followed by a space, then prints an empty string with terminating newline at the end.
For example, the following script words.sh will give you what you need:
tr '[A-Z]' '[a-z]' | tr ' ' '\012' | sort -n | uniq -c | sort -r
| head -n 3 | awk '{printf "%s ", $2}END{print ""}'
(on one line: I've split it for readability) as per the following transcript:
pax> echo One Two two Three three three Four four four four | ./words.sh
four three two
You can achieve the same end with the following alias:
alias words="tr '[A-Z]' '[a-z]' | tr ' ' '\012' | sort -n | uniq -c | sort -r
| head -n 3 | awk '{printf \"%s \", \$2}END{print \"\"}'"
(again, one line) but, when things get this complex, I prefer a script, if only to avoid interminable escape characters :-)

Resources