How to find the nth multiline block of text using sed

How to find the nth multiline block of text using sed - bash

So I have a file which contains blocks that looks as follows:
menuentry ... {
....
....
}
....
menuentry ... {
....
....
}
I need to look at the contents of each menu entry in a bash script. I have very limited experience with sed but through a very exhaustive search I was able to construct the following:
cat $file | sed '/^menuentry.*{/!d;x;s/^/x/;/x\{1\}/!{x;d};x;:a;n;/^}/!ba;q'
and I can replace the \{1\} with whatever number I want to get the nth block. This works fine like that, but the problem is I need to iterate through an arbitrary number of times:
numEntries=$( egrep -c "^menuentry.*{" $file )
for i in $( seq 1 $numEntries); do
i=$( echo $i | tr -d '\n' ) # A google search indicated sed encounters problems
# when a substituted variable has a trailing return char
# Get the nth entry block of text with the sed statement from before,
# but replace with variable $i
entry=$( cat $file | sed '/^menuentry.*{/!d;x;s/^/x/;/x\{$i\}/!{x;d};x;:a;n;/^}/!ba;q')
# Do some stuff with $entry #
done
I've tried with every combination of quotes/double quotes and braces around the variable and around the sed statement and every which way I do it, I get some sort of error. As I said, I don't really know much about sed and that frankenstein of a statement is just what I managed to mish-mosh together from various google searches, so any help or explanation would be much appreciated!
TIA

sed is for simple substitutions on individual lines, that is all, and shell loops to manipulate text are immensely slow and difficult to write robustly. The folks who invented sed and shell also invented awk for tasks like this:
awk -v RS= 'NR==3' file
would print the 3rd blank-line-separated block of text as shown in your question. This would print every block containing the string "foobar":
awk -v RS= '/foobar/' file
and anything else you might want to do is equally trivial.
The above will work efficiently, robustly and portably using any awk in any shell on any UNIX box. For example:
$ cat file
menuentry first {
Now is the Winter
of our discontent
}
menuentry second {
Wee sleekit cowrin
timrous beastie,
oh whit a panic's in
thy breastie.
}
menuentry third {
Twas the best of times
Twas the worst of times
Make up your damn mind
}
.
$ awk -v RS= 'NR==3' file
menuentry third {
Twas the best of times
Twas the worst of times
Make up your damn mind
}
$ awk -v RS= 'NR==2' file
menuentry second {
Wee sleekit cowrin
timrous beastie,
oh whit a panic's in
thy breastie.
}
$ awk -v RS= '/beastie/' file
menuentry second {
Wee sleekit cowrin
timrous beastie,
oh whit a panic's in
they breastie.
}
If you find yourself trying to do anything other than s/old/new with sed and/or using sed commands other than s, g and p (with -n) then you are using constructs that became obsolete in the mid-1970s when awk was invented.
If the above doesn't work for you then edit your question to provide more truly representative sample input and expected output.

How about splitting this work up?
First search for lines, where a menu entry starts (for personal reasons, I here use grub, not grub2, adjust to your needs):
entrystarts=($(sed -n '/^menuentry.*/=' /boot/grub/grub.cfg))
Then, in a second step, choose a starting val from the array, for the n-th entry ${entrystarts[$n]} and proceed from there? Afaik, the entry end is easy to detect by a single closing curly brace.
for i in ${entrystarts[#]}
do
// your code here, proof of concept (note grub/grub2):
sed -n "$i,/}/p" /boot/grub/grub.cfg
done

I see the problem. You are trying to use $i inside single quotes. $i will be left as $i As I posted below correcting it with single quotes around $i so that bash will see it as 3 strings
'one string'"$twostrings"'three strings'
#!/bin/bash
#filename:grubtest.sh
numEntries=$( egrep -c "^menuentry.*{" "$1" )
i=0
while [ $i -lt $numEntries ]; do
i=$(($i+1))
# Get the nth entry block of text with the sed statement from before,
entry=$( cat "$1" | sed '/^menuentry.*{/!d;x;s/^/x/;/x\{'"$i"'\}/!{x;d};x;:a;n;/^}/!ba;q')
# Do some stuff with $entry #
echo $entry|cut -d\' -f2
done
this returned my grub items when ran like so
./grubtest.sh /etc/grub2.cfg

Related

How can I generate multiple counts from a file without re-reading it multiple times?

I have large files of HTTP access logs and I'm trying to generate hourly counts for a specific query string. Obviously, the correct solution is to dump everything into splunk or graylog or something, but I can't set all that up at the moment for this one-time deal.
The quick-and-dirty is:
for hour in 0{0..9} {10..23}
do
grep $QUERY $FILE | egrep -c "^\S* $hour:"
# or, alternately
# egrep -c "^\S* $hour:.*$QUERY" $FILE
# not sure which one's better
done
But these files average 15-20M lines, and I really don't want to parse through each file 24 times. It would be far more efficient to parse the file and count each instance of $hour in one go. Is there any way to accomplish this?

You can ask grep to output the matching part of each line with -o and then use uniq -c to count the results:
grep "$QUERY" "$FILE" | grep -o "^\S* [0-2][0-9]:" | sed 's/^\S* //' | uniq -c
The sed command is there to keep only the two digit hour and the colon, which you can also remove with another sed expression if you want.
Caveats: this solution works with GNU grep and GNU sed, and will produce no output, rather than "0", for hours with no log entries. Kudos to #EdMorton for pointing these issues out in the comments, and other issues that were fixed in the answer above.

Assuming the timestamp appears with a space before the 2-digit hour, then a colon after
gawk -v patt="$QUERY" '
$0 ~ patt && match($0, / ([0-9][0-9]):/, m) {
print > (m[1] "." FILENAME)
}
' "$FILE"
This will create 24 files.
Requires GNU awk for the 3-arg form of match()

This is probably what you really need, using GNU awk for the 3rd arg to match() and making assumptions about what your input might look like, what your QUERY variable might contain, and what the output should look like:
awk -v query="$QUERY" '
match($0, " ([0-9][0-9]):.*"query, a) { cnt[a[1]+0]++ }
END {
for (hr=0; hr<=23; hr++) {
printf "%02d = %d\n", hr, cnt[hr]
}
}
' "$FILE"
Don't really use all upper case for non-exported shell variables btw - see Correct Bash and shell script variable capitalization.

fast alternative to grep file multiple times?

I currently use long piped bash commands to extract data from text files like this, where $f is my file:
result=$(grep "entry t $t " $f | cut -d ' ' -f 5,19 | \
sort -nk2 | tail -n 1 | cut -d ' ' -f 1)
I use a script that might do hundreds of similar searches of $f ,sorting selected lines in various ways depending on what I'm pulling out. I like one-line bash strings with a bunch of pipes because its compact and easy, but it can take forever. Can anyone suggest a faster alternative? Maybe something that loads the whole file into memory first?
Thanks

You might get a boost with doing the whole pipe with gawk or another awk that has asorti by doing:
contents="$(cat "$f")"
result="$(awk -vpattern="entry t $t" '$0 ~ pattern {matches[$5]=$19} END {asorti(matches,inds); print inds[1]}' <<<"$contents")"
This will read "$f" into a variable then we'll use a single awk command (well, gawk anyway) to do all the rest of the work. Here's how that works:
-vpattern="entry t $t": defines an awk variable named pattern that contains the shell variable t
$0 ~ pattern matches the current line against the pattern, if it matches we'll do the part in the braces, otherwise we skip it
matches[$5]=$19 adds an entry to an array (and creates the array if needed) where the key is the 5th field and the value is the 19th
END do the following function after all the input has been processed
asorti(matches,inds) sort the entries of matches such that the inds is an array holding the order of the keys in matches to get the values in sorted order
print inds[1] prints the index in matches (i.e., a $5 from before) associated with the lowest 19th field
<<<"$contents" have awk work on the value in the shell variable contents as though it were a file it was reading
Then you can just update the pattern for each, not have to read the file from disk each time and not need so many extra processes for all the pipes.
You'll have to benchmark to see if it's really faster or not though, and if performance is important you really should think about moving to a "proper" language instead of shell scripting.

Since you haven't provided sample input/output this is just a guess and I only post it because there's other answers already posted that you should not do, so - this may be what you want instead of that one line:
result=$(awk -v t="$t" '
BEGIN { regexp = "entry t " t " " }
$0 ~ regexp {
if ( ($6 > maxKey) || (maxKey == "") ) {
maxKey = $6
maxVal = $5
}
}
END { print maxVal }
' "$f")
I suspect your real performance issue, however, isn't that script but that you are running it and maybe others inside a loop that you haven't shown us. If so, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice and post a better example so we can help you.

How to get line WITH tab character using tail and head

I have made a script to practice my Bash, only to realize that this script does not take tabulation into account, which is a problem since it is designed to find and replace a pattern in a Python script (which obviously needs tabulation to work).
Here is my code. Is there a simple way to get around this problem ?
pressure=1
nline=$(cat /myfile.py | wc -l) # find the line length of the file
echo $nline
for ((c=0;c<=${nline};c++))
do
res=$( tail -n $(($(($nline+1))-$c)) myfile.py | head -n 1 | awk 'gsub("="," ",$1){print $1}' | awk '{print$1}')
#echo $res
if [ $res == 'pressure_run' ]
then
echo "pressure_run='${pressure}'" >> myfile_mod.py
else
echo $( tail -n $(($nline-$c)) myfile.py | head -n 1) >> myfile_mod.py
fi
done
Basically, it finds the line that has pressure_run=something and replaces it by pressure_run=$pressure. The rest of the file should be untouched. But in this case, all tabulation is deleted.

If you want to just do the replacement as quickly as possible, sed is the way to go as pointed out in shellter's comment:
sed "s/\(pressure_run=\).*/\1$pressure/" myfile.py
For Bash training, as you say, you may want to loop manually over your file. A few remarks for your current version:
Is /myfile.py really in the root directory? Later, you don't refer to it at that location.
cat ... | wc -l is a useless use of cat and better written as wc -l < myfile.py.
Your for loop is executed one more time than you have lines.
To get the next line, you do "show me all lines, but counting from the back, don't show me c lines, and then show me the first line of these". There must be a simpler way, right?
To get what's the left-hand side of an assignment, you say "in the first space-separated field, replace = with a space , then show my the first space separated field of the result". There must be a simpler way, right? This is, by the way, where you strip out the leading tabs (your first awk command does it).
To print the unchanged line, you do the same complicated thing as before.
A band-aid solution
A minimal change that would get you the result you want would be to modify the awk command: instead of
awk 'gsub("="," ",$1){print $1}' | awk '{print$1}'
you could use
awk -F '=' '{ print $1 }'
"Fields are separated by =; give me the first one". This preserves leading tabs.
The replacements have to be adjusted a little bit as well; you now want to match something that ends in pressure_run:
if [[ $res == *pressure_run ]]
I've used the more flexible [[ ]] instead of [ ] and added a * to pressure_run (which must not be quoted): "if $res ends in pressure_run, then..."
The replacement has to use $res, which has the proper amount of tabs:
echo "$res='${pressure}'" >> myfile_mod.py
Instead of appending each line each loop (and opening the file each time), you could just redirect output of your whole loop with done > myfile_mod.py.
This prints literally ${pressure} as in your version, because it's single quoted. If you want to replace that by the value of $pressure, you have to remove the single quotes (and the braces aren't needed here, but don't hurt):
echo "$res=$pressure" >> myfile_mod.py
This fixes your example, but it should be pointed out that enumerating lines and then getting one at a time with tail | head is a really bad idea. You traverse the file for every single line twice, it's very error prone and hard to read. (Thanks to tripleee for suggesting to mention this more clearly.)
A proper solution
This all being said, there are preferred ways of doing what you did. You essentially loop over a file, and if a line matches pressure_run=, you want to replace what's on the right-hand side with $pressure (or the value of that variable). Here is how I would do it:
#!/bin/bash
pressure=1
# Regular expression to match lines we want to change
re='^[[:space:]]*pressure_run='
# Read lines from myfile.py
while IFS= read -r line; do
# If the line matches the regular expression
if [[ $line =~ $re ]]; then
# Print what we matched (with whitespace!), then the value of $pressure
line="${BASH_REMATCH[0]}"$pressure
fi
# Print the (potentially modified) line
echo "$line"
# Read from myfile.py, write to myfile_mod.py
done < myfile.py > myfile_mod.py
For a test file that looks like
blah
test
pressure_run=no_tab
blah
something
pressure_run=one_tab
pressure_run=two_tabs
the result is
blah
test
pressure_run=1
blah
something
pressure_run=1
pressure_run=1
Recommended reading
How to read a file line-by-line (explains the IFS= and -r business, which is quite essential to preserve whitespace)
BashGuide

Get first N chars and sort them

I have a requirement where i need to fetch first four characters from each line of file and sort them.
I tried below way. but its not sorting each line
cut -c1-4 simple_file.txt | sort -n
O/p using above:
appl
bana
uoia
Expected output:
alpp
aabn
aiou

sort isn't the right tool for the job in this case, as it used to sort lines of input, not the characters within each line.
I know you didn't tag the question with perl but here's one way you could do it:
perl -F'' -lane 'print(join "", sort #F[0..3])' file
This uses the -a switch to auto-split each line of input on the delimiter specified by -F (in this case, an empty string, so each character is its own element in the array #F). It then sorts the first 4 characters of the array using the standard string comparison order. The result is joined together on an empty string.

Try defining two helper functions:
explodeword () {
test -z "$1" && return
echo ${1:0:1}
explodeword ${1:1}
}
sortword () {
echo $(explodeword $1 | sort) | tr -d ' '
}
Then
cut -c1-4 simple_file.txt | while read -r word; do sortword $word; done
will do what you want.

The sort command is used to sort files line by line, it's not designed to sort the contents of a line. It's not impossible to make sort do what you want, but it would be a bit messy and probably inefficient.
I'd probably do this in Python, but since you might not have Python, here's a short awk command that does what you want.
awk '{split(substr($0,1,4),a,"");n=asort(a);s="";for(i=1;i<=n;i++)s=s a[i];print s}'
Just put the name of the file (or files) that you want to process at the end of the command line.
Here's some data I used to test the command:
data
this
is a
simple
test file
a
of
apple
banana
cat
uoiea
bye
And here's the output
hist
ais
imps
estt
a
fo
alpp
aabn
act
eiou
bey
Here's an ugly Python one-liner; it would look a bit nicer as a proper script rather than as a Bash command line:
python -c "import sys;print('\n'.join([''.join(sorted(s[:4])) for s in open(sys.argv[1]).read().splitlines()]))"
In contrast to the awk version, this command can only process a single file, and it reads the whole file into RAM to process it, rather than processing it line by line.

removing leading zeros from IP addresses: converting ipfilter.dat to bluetack.co.uk ipfilter with sed

I had a need to convert uTorrent-style ipfilter.dat into a bluetack-style ipfilter file, and wrote this shell script to achieve this:
#!/bin/bash
# read ipfilter.dat-formatted file line by line
# (example: 000.000.000.000-008.008.003.255,000,Badnet
# - ***here, input file's lines/fields are always the same length***)
# and convert into a bluetack.co.uk-formatted output
# (example: Badnet:0.0.0.0-8.8.3.255
# - fields moved around, leading zeros removed)
while read record
do
start=`echo ${record:0:15} | awk -F '.' '{for(i=1;i<=NF;i++)$i=$i+0;}1' OFS='.'`
end=`echo ${record:16:15} | awk -F '.' '{for(i=1;i<=NF;i++)$i=$i+0;}1' OFS='.'`
echo ${record:36:7}:${start}-${end}
done < $1
However, on a 2000-line input file this script takes on average 10(!) seconds to complete - a mere 200 lines/sec.
I'm sure this same result can be achieved with sed, and sed-version is likely to be much faster.
Is there a sed-guru around to suggest a solution for this kind of fixed-positions replacements?
Feel free to suggest a solution in other languages as well - I would enjoy testing a Python or a C version, for example. A more efficient shell/bash version would be welcome as well.

You could try this.
sed -r 's/^0*([0-9]+)\.0*([0-9]+)\.0*([0-9]+)\.0*([0-9]+)-0*([0-9]+)\.0*([0-9]+)\.0*([0-9]+)\.0*([0-9]+),...,(.*)$/\9:\1.\2.\3.\4-\5.\6.\7.\8/' inputfile
I didn't test the performance but I guess it could be faster than 200 lines/sec.

You will be sacrificing performance using the shell's while read loop on a big file. It is empirically proven that tools such as awk/sed (and some languages eg Perl/Python/Ruby) are better at iterating big files and processing the lines than the shell's while read loop. Moreover, in your script, while iterating over the lines, you are also piping a few calls to awk. This is extra overheads.
Ruby(1.9+)
$ cat file
000.000.000.000-008.008.003.255,000,Badnet
001.010.110.111-002.020.220.222,111,Badnet
$ ruby -F"," -ane 'puts "#{$F[-1].chomp}:" + $F[0].gsub(/(00|0)([0-9]+)([.-])/,"\\2\\3")' file
Badnet:0.0.0.0-8.8.3.255
Badnet:1.10.110.111-2.20.220.222

I really wanted to get this to work in a single sed command, but I wasn't able to figure it out. Surely this will still be faster than 200 lines/s though.
sed 's/\.0\{1,2\}/\./g' | sed 's/^0\{1,2\}//'

#!/bin/tclsh
#Regsub TCL script to remove the leading zeros from the ip address.
#Author : Shoeb Masood , Bangalore
puts "Enter the ip address"
set ip [gets stdin]
set list_ip [split $ip .]
foreach index $list_ip {
regsub {^0|^00} $index {\1} index
lappend list_ip2 $index
}
set list_ip2 [join $list_ip2 "."]
puts $list_ip2

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to find the nth multiline block of text using sed - bash

Related

How can I generate multiple counts from a file without re-reading it multiple times?

fast alternative to grep file multiple times?

How to get line WITH tab character using tail and head

Get first N chars and sort them

removing leading zeros from IP addresses: converting ipfilter.dat to bluetack.co.uk ipfilter with sed

Categories

Resources