Grep and output to a file between two known nodes - terminal

I have the following information being logged:
<MessageForTransform>
<node>
<taster></taster>
</node>
</MessageForTransform>
I am trying to figure out the regex to pull all of the information between and including the <MessageForTransform> tags.
Bonus points for using tail -f to watch and output to a file, I think the format is as follows:
$ tail -f logFileToRead.txt >>logFileToWrite.txt | grep "regex"
I was using the following regex
\<MessageForTransform\>[^]\<\/MessageForTransform\>
I'd like to give more information but this really as far as I can get without knowing the regex which I have had a look around for. The above works on regex checker if I post the raw XML into the page but not in the terminal.
Thanks in advance.

Usually... we don't use a regexp to parse an XML file !
But if you are sure that your xml files are using a standard way of being formatted, with separated lines starting and ending what you want to log, then you could awk this:
awk -v delimitertag="MessageForTransform" '
BEGIN { weprint=0 ; sd="<"delimitertag">"; ed="</"delimitertag">"}
( $0 ~ sd ) { weprint=1;}
( weprint == 1 ) { print $0 ;}
( $0 ~ ed ) { weprint=0;}
' your_file_here #... or nothing if you want to parse stdin
I go for readability over terseness (I'm sure you'll get better ways than this one)
It's easy to modify to not print the starting/ending lines, too.
And to add a little bit of flexibility, for ex: sd="<"delimitertag"[^>]*>"; in case those tags can have optionnaly some additionnal values embedded

Related

fast alternative to grep file multiple times?

I currently use long piped bash commands to extract data from text files like this, where $f is my file:
result=$(grep "entry t $t " $f | cut -d ' ' -f 5,19 | \
sort -nk2 | tail -n 1 | cut -d ' ' -f 1)
I use a script that might do hundreds of similar searches of $f ,sorting selected lines in various ways depending on what I'm pulling out. I like one-line bash strings with a bunch of pipes because its compact and easy, but it can take forever. Can anyone suggest a faster alternative? Maybe something that loads the whole file into memory first?
Thanks
You might get a boost with doing the whole pipe with gawk or another awk that has asorti by doing:
contents="$(cat "$f")"
result="$(awk -vpattern="entry t $t" '$0 ~ pattern {matches[$5]=$19} END {asorti(matches,inds); print inds[1]}' <<<"$contents")"
This will read "$f" into a variable then we'll use a single awk command (well, gawk anyway) to do all the rest of the work. Here's how that works:
-vpattern="entry t $t": defines an awk variable named pattern that contains the shell variable t
$0 ~ pattern matches the current line against the pattern, if it matches we'll do the part in the braces, otherwise we skip it
matches[$5]=$19 adds an entry to an array (and creates the array if needed) where the key is the 5th field and the value is the 19th
END do the following function after all the input has been processed
asorti(matches,inds) sort the entries of matches such that the inds is an array holding the order of the keys in matches to get the values in sorted order
print inds[1] prints the index in matches (i.e., a $5 from before) associated with the lowest 19th field
<<<"$contents" have awk work on the value in the shell variable contents as though it were a file it was reading
Then you can just update the pattern for each, not have to read the file from disk each time and not need so many extra processes for all the pipes.
You'll have to benchmark to see if it's really faster or not though, and if performance is important you really should think about moving to a "proper" language instead of shell scripting.
Since you haven't provided sample input/output this is just a guess and I only post it because there's other answers already posted that you should not do, so - this may be what you want instead of that one line:
result=$(awk -v t="$t" '
BEGIN { regexp = "entry t " t " " }
$0 ~ regexp {
if ( ($6 > maxKey) || (maxKey == "") ) {
maxKey = $6
maxVal = $5
}
}
END { print maxVal }
' "$f")
I suspect your real performance issue, however, isn't that script but that you are running it and maybe others inside a loop that you haven't shown us. If so, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice and post a better example so we can help you.

How to pull a value from between 2 strings which occur several times in a file

I am trying to pull the value from inbetween 2 strings and line break each result. I am then hoping to combine this with another value from the same document being pulled the same way. The problem is there are NO linebreaks in this file and it is quite large. Here is an example of the file.
<ID>47</ID><DATACENTER_ID>36</DATACENTER_ID><DNS_NAME>myhost.domain.local</DNS_NAME> <IP_ADDRESS>10.0.0.1</IP_ADDRESS><ID>60</ID><DATACENTER_ID>36</DATACENTER_ID><DNS_NAME>yourhost.domain.local</DNS_NAME><IP_ADDRESS>10.0.0.2</IP_ADDRESS>
My end result would ideally look something like this.
ID-----DNS_NAME
47-----myhost.domain.local
60-----yourhost.domain.local
My closest attemps so far have been creating variables with grep, but I cant seem to format them into a table. Im also very new to scripting so forgive my ignorance.
If your grep supports -P (--Perl-regexp), then you're free to use the below regex.
$ grep -oP '<ID>\K[^<>]*(?=</ID>)|<DNS_NAME>\K[^<>]*(?=</DNS_NAME>)' file | sed 'N;s/\n/-----/g'
47-----myhost.domain.local
60-----yourhost.domain.local
\K Discards the previously matched characters from printing.
(?=...) posiitve lookahead assertion which asserts where the match would occur. It won't consume any characters.
Here is an gnu awk (do to multiple characters in RS) to get your data:
awk -v RS="<ID>" -F"<|>" 'NR>1 {print $1"-----"$9}' file
47-----myhost.domain.local
60-----yourhost.domain.local

parse word from html file

I am having a lot of trouble trying to extract a word from an html file. The line in the html file appears like this:
<span id="result">WORD</span>
I am trying to get the WORD out but I can't figure it out. So far I've got:
grep 'span id="result"' FILE
Which just gets me the line. I've also tried:
sed -n '/<span id="result">/,/<\/span>/p' FILE
which didn't work either.
I know this is probably a very simple question, but I'm just beginning so I could really use some help.
Do not use regex to parse html.
Use a html parser.
My Xidel has the shortest syntax for this:
xidel FILE -e "#result"
This is a task for awk
I do guess you have other line in same files so a search for span id is a must.
echo "<span id="result">WORD</span>" | awk -F"[<>]" '/span id/ {print $3}'
WORD
You can try
awk -f ext.awk input.html
where input.html is your input html file, and ext.awk is
{
line=line $0 RS
}
END {
match (line,/<span id="result">([^<]*)<\/span>/,a)
print a[1]
}
This will extract the contents across line breaks..
Use grep with backward reference:
grep -Po '(?<=<span id="result">)\w+'
The expression between parenthèses is a backward reference; it is not captured but serves as test for the following regex part: if the expression appears, the captured pattern is only \w+ here. Add option -o for outputting only the word; option -P enables forward and backward references.
If you want to modifiy this regex, please note that with grep, a backward reference must have a fixed size.

sed/awk - print text between patterns spanned across multiple lines

I am new to scripting and was trying to learn how to extract any text that exists between two different patterns. However, I am still not able to figure out how to extract text between two patterns in the following scenario:
If I have my input file reading:
Hi I would like
to print text
between these
patterns
and my expected output is like:
I would like
to print text
between these
i.e. my first search pattern is "Hi' and skip this pattern, but print everything that exists in the same line following that matched pattern. My second search pattern is "patterns" and I would like to completely avoid printing this line or any lines beyond that.
I tried the following:
sed -n '/Hi/,/patterns/p' test.txt
[output]
Hi I would like
to print text
between these
patterns
Next, I tried:
`awk ' /'"Hi"'/ {flag=1;next} /'"pattern"'/{flag=0} flag { print }'` test.txt
[output]
to print text
between these
Can someone help me out in identifying how to achieve this?
Thanks in advance
You have the right idea, a mini-state-machine in awk but you need some slight mods as per the following transcript:
pax> echo 'Hi I would like
to print text
between these
patterns ' | awk '
/patterns/ { echo = 0 }
/Hi / { gsub("^.*Hi ", "", $0); echo = 1 }
{ if (echo == 1) { print } }'
Or, in compressed form:
awk '/patterns/{e=0}/Hi /{gsub("^.*Hi ","",$0);e=1}{if(e==1){print}}'
The output of that is:
I would like
to print text
between these
as requested.
The way this works is as follows. The echo variable is initially 0 meaning that no echoing will take place.
Each line is checked in turn. If it contains patterns, echoing is disabled.
If it contains Hi followed by a space, echoing is turned on and gsub is used to modify the line to get rid of everything up to the Hi.
Then, regardless, the line (possibly modified) is echoed when the echo flag is on.
Now, there's going to be edge cases such as:
lines containing two occurrences of Hi; or
lines containing something before the patterns.
You haven't specified how they should be handled so I didn't bother, but the basic concept should be the same.
Updated the solution to remove the line "patterns" :
$ sed -n '/^Hi/,/patterns/{s/^Hi //;/^patterns/d;p;}' file
I would like
to print text
between these
This might work for you (GNU sed):
sed '/Hi /!d;s//\n/;s/.*\n//;ta;:a;s/patterns.*$//;tb;$!{n;ba};:b;/^$/d' file
Just set a flag (f) when you find+replace Hi at the start of a line, clear it when you find patterns, then invoke the default print when the flag is set:
$ awk 'sub(/^Hi /,""){f=1} /patterns/{f=0} f' file
I would like
to print text
between these

How to parse a CSV in a Bash script?

I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:
The index of the identifier
The identifier value
I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).
Any ideas, taking in special consideration for performance?
As an alternative to cut- or awk-based one-liners, you could use the specialized csvtool aka ocaml-csv:
$ csvtool -t ',' col "$index" - < csvfile | grep "$value"
According to the docs, it handles escaping, quoting, etc.
See this youtube video: BASH scripting lesson 10 working with CSV files
CSV file:
Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME
Bash script:
#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
do
echo -e "$user \
======================\n\
Role :\t $job\n\
ID :\t $uid\n\
SITE :\t $location\n"
done < $1
IFS=$OLDIFS
Output:
Bob Brown ======================
Role : Manager
ID : 16581
SITE : Main
Sally Seaforth ======================
Role : Director
ID : 4678
SITE : HOME
First prototype using plain old grep and cut:
grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"
If that's fast enough and gives the proper output, you're done.
CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.
So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly’, bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.
In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:
Name,Phone
"Woo, John",425-555-1212
You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:
#!/usr/bin/env tclsh
package require csv
package require Tclx
# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue
# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1
for_file line $fileName {
set columns [csv::split $line]
set columnValue [lindex $columns $columnNumber]
if {$columnValue == $expectedValue} {
puts $line
}
}
Save this script to a file called csv.tcl and invoke it as:
$ tclsh csv.tcl filename indexNumber expectedValue
Explanation
The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.
Using awk:
export INDEX=2
export VALUE=bar
awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv
Edit: As per Dennis Williamson's excellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v switch:
awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv
Jeez...with variables, and everything, awk is almost a real programming language...
For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.
If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.
You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):
csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u
This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.
index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file
I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py, since it converts CSV into tab-separated values - TSV)
#!/usr/bin/env python
import sys, csv
with sys.stdin as f:
reader = csv.reader(f)
for row in reader:
for col in row:
print col+'\t',
print
Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:
> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,
In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.
Parsing CSV with primitive text-processing tools will fail on many types of CSV input.
xsv is a lovely and fast tool for doing this properly. To search for all records that contain the string "foo" in the third column:
cat file.csv | xsv search -s 3 foo
A sed or awk solution would probably be shorter, but here's one for Perl:
perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`
where <INDEX> is 0-based (0 for first column, 1 for 2nd column, etc.)
Awk (gawk) actually provides extensions, one of which being csv processing.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print $0}'
The output is:
"James T. Kirk",123
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print $0 }, aka prints the full line as requested.

Resources