Getting information from XML document and comparing it - bash

I am trying to get information from an XML document and then taking that information and compare it to the information I am getting from a text document. But whenever I compare the number I receive from the XML to the number from the text document, the script always tells me they are not equal, even when they should be.
I am using xmllint to get the information from the XML document and read to get the number from the text document. Then I'm trying to compare them and if they're the same do something. But this is the point that I'm stuck at.
input_3="/Users/unix/Desktop/text.txt"
VAR_4= xmllint --xpath "string(//number)" /Users/unix/Desktop/01/testxml.xml
while IFS= read -r line
do
if [[ "$VAR_4" == "$line" ]]
then echo "YEAH"
else echo "Why"
fi
done < "$input_3"
With this code, I always go into the else part of the statement even though the numbers should be the same. I have been working with echo to check the numbers to make sure they're the same and the only thing I could think of is that is has something to do with new lines or spaces. That either xmllint or read with IFS is putting a new line behind the number and that's the reason the script doesn't consider them the same. For example in my text document the number is 2, but I have to put a newline behind it, so read gets the number, and in the XML, the number is 2 too. I am hoping someone can maybe give me a clue about how I can change the format of the outputs I am getting or how to get the outputs on the same level.

Related

Bash: count how many times a word is contained in all the files of a given folder

I'm just trying to count the occurrences of a word without writing an iteration file by file. I don't mind which kind of file it is. The closest I got is:
COUNT=$(grep -r -n -i "theWordImSearchingFor" .)
echo $COUNT
I thought about splitting that by spaces, but the problem is the output does not contain just the filename and the line but also the content (and that may have tons of spaces). w.g. I got:
./doc1.txt:29: This is the content containing theWordImSearchingFor but also other stuff
./doc1.txt:43: This is another line containing theWordImSearchingFor
./dir123/doc2.txt:339: .This is another...file...theWordImSearchingFor....
Any idea on how to keep it simple? TIA
To count the number of occurrences of a specific word, you need to use the same layout of code, but simpler. There are many ways to do this, but there are two much simpler versions of the word count that you have listed here.
The much two simpler versions,
1st way
2nd way
They both should work, unless problem with package installation.

Scraping specific hyperlinks from a website using bash

I have a website containing several dozen hyperlinks in the following format :
<a href=/news/detail/1/hyperlink>textvalue</a>
I want to get all hyperlinks, and their text values, where the hyperlink begins with /news/detail/1/.
The output should be in the following format :
textvalue
/news/detail/1/hyperlink
First of all, people are going to come in here (possibly talking about someone named Cthuhlu) and tell you that awk/regex are not HTML parsers. And they are right, and you should give some thought to what they say. Realistically, you can very often get away with something like this:
sed -n 's/^.*<a\s\+href\=\([^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html
This tells sed to read the file input_file.html, find lines that match the regex, replace them with the sections you specified for the output, and discard everything else. The result will print to the terminal.
This also assumes that the file is formatted such that each instance of <a href=/news/detail/1/hyperlink>textvalue</a> is on a separate line. The regex could easily be modified to accommodate different formatting, if needed.
If all of the links you want happen to start with /news/detail/1/, this will probably work:
sed -n 's/^.*<a\s\+href\=\(\/news\/detail\/1\/[^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html

Use bash to extract data between two regular expressions while keeping the formatting

but I have a question about a small piece of code using the awk command. I have not found an answer/solution anywhere.
I am trying to parse an output file and extract all data between the 1st expression (including) ATOMIC and 2nd expression (excluding) Bond. This data is to be sent to a new file $1_geom. So far I have the following:
`awk '/ATOMIC/{flag=1;next}/Bond lengths in Bohr/{flag=0}flag' $1` >> $1_geom
This script will extract the correct data for me, but there are 2 problems:
The line ATOMICis not extracted with the data
The data is extracted and appended to a single line. I want the data to retain the formatting from the parsed file (5 columns, variable amount of lines). Please see attachment to see a visual. Visual Example Attachment. Is there a different way to append data (other than >>) so that I can keep formatting?
Any help is appreciated, thank you.
The next is causing the first match to be skipped; take it out if you don't want that.
The backticks by themselves are a shell syntax error (unless your Awk script happens to produce valid shell commands). I'm guessing you have a useless echo or something like that in your actual script which disarms the error, but instead produces the symptoms you describe.
This was part of a code in a csh script and I did have an "echo" in front of this line. Removing the "echo" makes it work perfectly and addresses the 2 questions that I had.

sed delete unmatched lines between two lines with bash variable

I need help understanding a weird problem with sed, bash and a while loop.
MY data looks like this:
-File 1- CSV
account,hostnames,status,ipaddress,port,user,pass
-File 2- XML - This is a sample record set for two entries under one account
<accountname="account">
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
</accountname>
So far, I can add records in between the respective account holder from File1 to File2. But, if I need to remove records that no longer exists it does not work efficiently since it wipes other records from different accounts, ie it does not delete between a matched accountname.
I import from File 1 into File 2 with a while loop in my bash program:
-Bash Program excerpts-
//Read File in to F//
cat File 2 | while read F
do
//extract fields from F into variables
_vmname="$(echo $F |grep 'cname'| sed 's/<cname="//g' |sed 's/.\{2\}$//g')"
_account="$(echo $F | grep 'accountname' | sed 's/accountname="//g' |sed 's/.\{2\}$//g')"
// I then compare my File1 and look for stale records that are still in File2
if grep "$_vmname" File1 ;then
continue
else
// if not matched, delete between the respective accountname
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
If I manually declare _vmname and _account and run
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
It removes the stale records from File2. When I let my bash script run, it does not.
I think I have three problems:
Reading the variables for _vmname and _account name inside a loop makes it read numerous times. Any better way to do is appreciated.
I do not think the sed statement for matching these two patterns and then delete works like I want inside a while loop.
I may have a logic problem with my thought chain.
Any pointers, and please no awk, perl, lxml or python for this one.
Thanks!
and please no awk
I appreciate that you want to keep things simple, and I suppose awk seems more complicated than what you're doing. But I'd like to point out you have so far 3 grep and 4 sed invocations per line in the file, to process another file N times, once per line. That's O(mn) using the slowest method on the planet to read the file (a while loop). And it doesn't work.
I may have a logic problem with my thought chain.
I'm afraid we must allow for that possibility!
The right advice is to tackle XML with an XML parser, because XML is not a regular language and so can't be parsed with regular expressions. And that's really what you need here, because your program processes the whole XML document. You're not just plucking out bits and depending on incidental formatting artifacts; you want to add records that aren't there and remove those that "no longer exist". Apparently there is information in the XML document you need to preserve, else you would just produce it from the CSV. A parser would spoon-feed it to you.
The second-best advice is to use awk. I suppose you might try an approach like:
Process the CSV and produce the XML to be inserted.
In awk, first read the new input XML into an array keyed by cname, Then process the XML target once. For every CNAME, consult your array; if you find a match, insert your pre-constructed XML replacement (or modify the "paragraph" accordingly).
I'm not sure what the delete criteria are, so I don't know if it can be done in the same pass with step #2. If not, extract the salient information somehow. Maybe print a list of keys from each of the two files, and use comm(1) to produce a list of to-be-deleted. Then, similar to step #2, read in that list, and process the XML file one more time. Write anything you delete to stderr so you can keep track of what went missing, from what lines.
Any pointers
Whenever you find yourself processing the same file N times for N inputs, you know you're headed for trouble. One of the two inputs is always smaller, and that one can be put in some kind of array. cat file | while read is another warning signal, telling you use awk or any of a dozen obvious utilities that understand lines of text.
You posted your question on SO two weeks ago. I suspect no one answered it because you warned them away: preemptively saying, in effect, don't tell me to use good tools. I'm only here to suggest that you'll be more comfortable after you take off that straightjacket. Better tools, in this case, are the only right answer.

Make an arrays of links contained in one big string

I have a big string ( a html code from web page).
Now the problem is how to parse the links to images.
I want to make an array of all the links to images in that web page.
I know how to do this i java but I do not know how to do the parse strings and do a string manipulations in shell. I know there are many tricks and I guess this can be easy done.
in the end I want to get something like this
#!/bin/bash
read BIG_STRING <<< $(curl some_web_page_with_links_to_images.com)
#parse the big string and fill the LINKS variable
# fill this with the links to image somewhow (.jpg and .png only)
#after the parsing the LINKS should look like this
LINKS=("www.asd.com/asd1.jpg" "www.asd.com/asd.jpg" "www.asd.com/asd2123.jpg")
#I need the parsing and to fill the LINKS variable with the links from the web page
# get length of an array
tLen=${#LINKS[#]}
for (( i=0; i<${tLen}; i++ ));
do
echo ${LINKS[$i]}
done
Thanks, for the responses you saved me days of frustrations
Why not start with the right tool? Parsing HTML is hard, especially with sed. If you have the mojo tool from the Mojolicious project you can do this:
mojo get http://example.com a attr href
And then just check whether each line ends with jpg, png, or whatever.
It's hard to offer more than approximations. Let's assume all interesting links are href="" attributes, and there's at most one href attribute per line (And the links are also one line only, actually I'm not sure if newlines are allowed inside URLs.
Let's assume your sourcefile is called test.html.
The following should print all links under these assumptions:
sed -n 's/.*\<href="\([^"]*\)".*/\1/p' test.html
To understand how this works, you should know what regular expressions are and have read up a tutorial on sed (particularly how the s ubstitute command works)

Resources