Say I have a bunch of XML files which contain no newlines, but basically contain a long list of records, delimited by </record><record>
If the delimiter were </record>\n<record> I would be able to do something like cat *.xml | grep xyz | wc -l to count instances of records of interest, because cat would emit the records one per line.
Is there a way to write SOMETHING *.xml | grep xyz | wc -l where SOMETHING can stream out the records one per line? I tried using awk for this but couldn't find a way to avoid streaming the whole file into memory.
Hopefully the question is clear enough :)
This is a little ugly, but it works:
sed 's|</record>|</record>\
|g' *.xml | grep xyz | wc -l
(Yes, I know I could make it a little bit shorter, but only at the cost of clarity.)
If your record body has no character like < or / or >, then you may try this:
grep -E -o 'SEARCH_STRING[^<]*</record>' *.xml| wc -l
or
grep -E -o 'SEARCH_STRING[^/]*/record>' *.xml| wc -l
or
grep -E -o 'SEARCH_STRING[^>]*>' *.xml| wc -l
Here is a different approach using xsltproc, grep, and wc. Warning: I am new to XSL so I can be dangerous :-). Here is my count_records.xsl file:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" /> <!-- Output text, not XML -->
<xsl:template match="record"> <!-- Search for "record" node -->
<xsl:value-of select="text()"/> <!-- Output: contents of node record -->
<xsl:text> <!-- Output: a new line -->
</xsl:text>
</xsl:template>
</xsl:stylesheet>
On my Mac, I found a command line tool called xsltproc, which read instructions from an XSL file, process XML files. So the command would be:
xsltproc count_records.xsl *.xml | grep SEARCH_STRING | wc -l
The xsltproc command displays the text in each node, one line at a time
The grep command filters out the text you are interested in
Finally, the wc command produces the count
You may also try xmlstarlet for gig-sized files:
# cf. http://niftybits.wordpress.com/2008/03/27/working-with-huge-xml-files-tools-of-the-trade/
xmlstarlet sel -T -t -v "count(//record[contains(normalize-space(text()),'xyz')])" -n *.xml |
awk '{n+=$1} END {print n}'
xmlstarlet sel -T -t -v "count(//record[contains(normalize-space(text()),'xyz')])" -n *.xml |
paste -s -d '+' /dev/stdin | bc
Related
I need to edit a bash script that sorts .vcf files. vcf files are roughly structured as shown below:
## header line
## header line
…
Data line
Data line
…
The script is called vcfsort and is part of a library for manipulating vcf files. It looks like this:
head -1000 $1 | grep "^#"; cat $# | grep -v "^#" | sort -k1,1d -k2,2n
And it is run by writing vcfsort input.vcf > output.vcf.
I understand roughly what it does: since sorting should only be done on the data lines, it gets the header lines:
head -1000 $1 | grep "^#";
And combines it with sorted data lines:
cat $# | grep -v "^#" | sort -k1,1d -k2,2n
I need the head command to read more lines. Instead of calling vcfsort like above, I thought I could just edit the script myself and write it out directly as a command like this:
head -10000 input.vcf | grep "^#"; cat input.vcf | grep -v "^#" | sort -k1,1d -k2,2n > output.vcf
This does not work as expected. My attempt above writes the correct output to stdout, if I leave out > output.vcf. However, if I include it, only the data lines are written to file and the header lines are written to stdout. So, I have a couple of questions:
In this stack overflow answer, it is said that to combine
semicolon-separated commands, they should be enclosed in parentheses. Why is that not the case in the vcfsort script?
Why is $# used in the cat command instead of $1? $# should refer to all of a shell scripts arguments, but since only one is given (the input file), why not just use $1? If there is a reason for this, how can I transfer that to my command line expression?
Why do I only get part of the stdout when I send it to a file?
Could you show me the edits I need to make to get my command to work as intended?
So the script gets first 1000 lines of first file!
Separates header, and basically just copy all comments in those first 1000 lines to output.
Next, it filters all comments lines (leaving only data lines) for all files, and does sorting.
so if you use
vcfsort file1 file2 file3
$1 = "file1" and header from file1 only will be presented in output.
while $# referring to all files: "file1 file2 file3"
if you need to get headers from all files and merge it - I would recommend to use loop.
for file in $#; do
head -1000 $file | grep "^#";
done
cat $# | grep -v "^#" | sort -k1,1d -k2,2n
Why do I only get part of the stdout when I send it to a file?
head -10000 input.vcf | grep "^#"; cat input.vcf | grep -v "^#" | sort -k1,1d -k2,2n > output.vcf
Each command executing separatelly (divided by semicolon ";"). So in example above you just redirecting data lines output after sorting. It doesn't redirect to file header part.
I would recommend to delete redirecting to file and just use:
vcfsort input.vcf > output.vcf
This does not work as expected
May I know what was expected?
There are two command lists, separated by a ;, inside vcfsort:
head -1000 $1 | grep "^#"
cat $# | grep -v "^#" | sort -k1,1d -k2,2n
Each list is a single pipeline. The final two commands in each pipeline inherit their standard output from vcfsort, so that when you run
vcfsort input.vcf > output.vcf
both grep and sort write to output.vcf.
The equivalent using braces would be (replacing ; with a newline for readability)
# Quoting the parameter expansions is important, to protect
# against word-splitting and pathname expansion of the original arguments.
{ head -1000 "$1" | grep "^#"
cat "$#" | grep -v "^#" | sort -k1,1d -k2,2n
} > output.vcf
Output redirections apply only to a single command, not a command list. Here, a command group serves as that single command:
the standard output of the command group is output.vcf, and the two lists in the group inherit that just as before.
Your attempt
head -10000 input.vcf | grep "^#"; cat input.vcf | grep -v "^#" | sort -k1,1d -k2,2n > output.vcf
only opened output.vcf to use as the standard output for sort; the standard output of grep remains whatever standard output it inherits from its parent, namely your terminal.
I found that grep have some internal limit to number of lines processed.
Is there a way to remove this limit?
$ cat debug-2020-09-14.log | wc -l
5255625
$ cat debug-2020-09-14.log | grep -v "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" | wc -l
3239948
$ cat debug-2020-09-14.log | grep "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" | wc -l
0
I suspect you have binary data in your log file.
Once grep matches a line with binary data in it, grep prints Binary file (standard input) matches (to stdout, not stderr!) and exits. All matches after the binary part will be ignored.
To confirm this theory run
grep . debug-2020-09-14.log | grep -x 'Binary file .* matches'
If this is indeed the problem, then you can fix it using grep's -a option. Here we also replaced cat and wc -l by grep's capabilities.
grep -ac aaaa debug-2020-09-14.log
From man grep:
-a, --text
Process a binary file as if it were text;
this is equivalent to the --binary-files=text option.
--binary-files=TYPE
If a file's data or metadata indicate that the file contains binary data, assume that the file is of type TYPE.
[...] grep suppresses output after null input binary data is discovered [...]. When some output is suppressed, grep follows any output with a one-line message saying that a binary file matches.
I am using awk command to fetch values from xml elements. Using below command
$(awk -F "[><]" '/'$tag_name'/{print $3}' $FILE_NAME | sort | uniq)
Here
File_name: XML File.
tag_name: name of xml element whose value we
need.
Sample XML
<item>
<tag1>test</tag1>
<tag2><![CDATA[cdata_test]]></tag2>
</item>
One of the tag in xml contains CDATA. For that script is not working as expected.
When I tried to print it is printing blank.
Instead of using a specific tool as AWK, not aware of the XML specificities, I suggest you to use xmlstarlet for selecting the nodes you want. For instance:
xmlstarlet select -t -v '//tag1' -n input.xml
will give as result:
test
Issuing:
xmlstarlet select -t -v '//tag2' -n input.xml
gives as output:
cdata_test
If you don't need the newline at the end of the returned string, just remove the -n from the options of the xmlstarlet command.
Keep it simple.
As xmlstarlet is not installed on my machine.
I used sed prior to my awk command as follows and that works for me.
$(sed -e 's/<![CDATA[//g; s/]]>//g' ${FILE_NAME} | awk -F "[><]" '/'$tag_name'/{print $3}' | sort | uniq)
Also, If anybody has any other solution. That too is also welcome.
I have a feed.xml file that looks something like this. What I want to do is to grab the test.html from this feed.(Basically, the top most item's url.) Any thoughts on how to do this?
<rss>
<item>
<title>ABC</title>
<url>
test.html
</url>
</item>
<item>
<title>CDE</title>
<url>
test1.html
</url>
</item>
</rss>
Thanks!
If the structure is fixed and you know that the URL has the postfix .html, you can simply do:
cat <yourfile> | grep ".html" | head -n1
If you don't know the postfix (or the string "html" can exist before), you can do:
cat <yourfile> | grep -A1 "<url>" | head -n2 | tail -n1
EDIT
In case, the structure is not fixed (i.e., no newlines), there this
cat <yourfile> | grep -o "<url>[^<]*</url>" | head -n1 | cut -d'>' -f2 | cut -d'<' -f1
or that
cat <yourfile> | grep -o "<url>[^<]*</url>" | head -n1 | sed -E -e"s#<url>(.*)</url>#\1#"
may work.
This might work for you:
sed '/<url>/,/<\/url>/{//d;s/ *//;q};d' file.xml
This awk script should work:
awk '/<url>/ && url==0 {url=1;next;} {if(url==1) {print;url=2;}}' file
EDIT:
Following grep command might also work:
grep -m 1 "^ *<url>" -A1 file | grep -v "<url>"
Instead of using line-based tools, I'd suggest using an xsl transform to get the data you want out of the document without making assumptions about the way it's formatted.
If you save this to get-url.xsl:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of select="normalize-space(rss/item/url)"/>
</xsl:template>
</xsl:stylesheet>
Then you can get the value of url from feed.xml like this:
$ xsltproc get-url.xsl feed.xml; echo
test.html
$
The extra echo is just there to give you a newline after the end of the output, to make it friendly for an interactive shell. Just remove it if you're assigning the result to a shell variable with $().
What Linux commands would you use successively, for a bunch of files, to count the number of lines in a file and output to an output file with part of the corresponding input file as part of the output line. So for example we were looking at file LOG_Yellow and it had 28 lines, the the output file would have a line like this (Yellow and 28 are tab separated):
Yellow 28
wc -l [filenames] | grep -v " total$" | sed s/[prefix]//
The wc -l generates the output in almost the right format; grep -v removes the "total" line that wc generates for you; sed strips the junk you don't want from the filenames.
wc -l * | head --lines=-1 > output.txt
produces output like this:
linecount1 filename1
linecount2 filename2
I think you should be able to work from here to extend to your needs.
edit: since I haven't seen the rules for you name extraction, I still leave the full name. However, unlike other answers I'd prefer to use head rather then grep, which not only should be slightly faster, but also avoids the case of filtering out files named total*.
edit2 (having read the comments): the following does the whole lot:
wc -l * | head --lines=-1 | sed s/LOG_// | awk '{print $2 "\t" $1}' > output.txt
wc -l *| grep -v " total"
send
28 Yellow
You can reverse it if you want (awk, if you don't have space in file names)
wc -l *| egrep -v " total$" | sed s/[prefix]//
| awk '{print $2 " " $1}'
Short of writing the script for you:
'for' for looping through your files.
'echo -n' for printing the current file
'wc -l' for finding out the line count
And dont forget to redirect
('>' or '>>') your results to your
output file