processing xml files with bash scripting

processing xml files with bash scripting - bash

I have an xml file which has the following structure that contains numerous <Episodes></Episodes> to which the structure looks like this:
<Episode>
<id>4195462</id>
<Combined_episodenumber>8</Combined_episodenumber>
<Combined_season>2</Combined_season>
<DVD_chapter></DVD_chapter>
<DVD_discid></DVD_discid>
<DVD_episodenumber></DVD_episodenumber>
<DVD_season></DVD_season>
<Director>Jay Karas</Director>
<EpImgFlag>2</EpImgFlag>
<EpisodeName>Karl's Wedding</EpisodeName>
<EpisodeNumber>8</EpisodeNumber>
<FirstAired>2011-11-08</FirstAired>
<GuestStars>Katee Sackhoff|Carla Gallo</GuestStars>
<IMDB_ID></IMDB_ID>
<Language>en</Language>
<Overview>Karl Hevacheck, aka the Human Genius, gets married.</Overview>
<ProductionCode>209</ProductionCode>
<Rating>7.6</Rating>
<RatingCount>20</RatingCount>
<SeasonNumber>2</SeasonNumber>
<Writer>Kevin Etten</Writer>
<absolute_number></absolute_number>
<filename>episodes/211751/4195462.jpg</filename>
<lastupdated>1362547148</lastupdated>
<seasonid>471254</seasonid>
<seriesid>211751</seriesid>
</Episode>
I've figured out how to pull the information between a single tag like so
value=$(grep -m 1 "<Rating>" path_to_file | sed 's/<.*>\(.*\)<\/.*>/\1/')
but I can't find a way to verify that I am looking at the correct episode ie. to check If this is the correct branch which is for <Combined_season>2</Combined_season> <EpisodeNumber>8</EpisodeNumber> before saving the values for specific attributes. I know this can somehow be done using a combination of sed and awk but can't seem to figure it out anyhelp on how I can do this would be greatly appreciated.

Use a proper XML parser not sed or awk. You can still call your XML parser from your bash script just like you would with sed or awk. It's a bad idea to use sed or awk because XML is a structured file, sed and awk typical work with line oriented files. You will just give yourself a headache by using the wrong tool for the job. I suggest using a dedicated tools or a language such a php, python or perl (or any other language not starting with p) that has libraries for parsing XML.

Related

Linux sed command that generates a new file on every regex match

I have the following Linux command which I am using to extract data from one very large log file.
sed -n "/<trade>/,/<\/trade>/p" Large.log > output.xml
However, the output is generated in a single file output.xml. My intention is to create a new file every time the "/<trade>/,/<\/trade>/p" is matched. Every new file will be named after the <id> tag which is inside the <trade> </trade> tags.
Something likes this...
sed -n "/<trade>/,/<\/trade>/p" Large.log > "/<id>/,/<\/id>/p".xml
However, that, of course, does not work and I am not sure how to apply a regex as a naming rule.
P.S At this point, I am also not sure if I should use sed or maybe I should try achieving this with awk

Issue with bash script using SED/AWK for substituion

I have been working on this little script at work to free up my own time and am currently stuck on part of it. The script is supposed to pull some content from a JSON, modify the content, and then re-upload it. The modification part is the portion that doesn't work.
An example of what the content looks like after being extracted from the JSON is:
<p>App1_v1.0_20160911_release.apk</p<p>App2_v2.0_20160915_beta.apk</p><p>App3_v3.0_20150909_VendorRelease.apk</p>
The modification function is supposed to update the list with the newer app filenames in the same location. I've tried using both SED and AWK to get this to work but I haven't gotten anywhere fast.
Here are examples of both commands and the parameters for the substitution I am trying to run on the example file:
old_name=App1_.*_release.apk
new_name=App1_v1.0_20160920_1152_release.apk
sed "s/$old_name/$new_name/" body > upload
awk -v oldname="$old_name" -v newname="$new_name" '{sub(oldname, newname)}1' body > upload
What ends up happening is the substitution will change the correct part of the list, but then nuke everything between that point and the end of the list.
Thank you for any and all help.
PS: If I didn't explain something correctly or you feel some information is missing, please comment and let me know so I can better explain the problem.

There are SO many possible values of oldname, newname, and your input data that could cause either of the commands you wrote to fail - don't use that "replace a regexp with a backreference-enabled-string" approach in any command, use string operations instead (which means you can't use sed since sed doesn't support strings)
This modifies your sample input as you say you want:
$ awk -v new='App1_v1.0_20160920_1152_release.apk' 'BEGIN{RS="</p>\n?"; FS=OFS="<p>"} NR==1{$2=new} {printf "%s%s", $0, RT}' file
<p>App1_v1.0_20160920_1152_release.apk<p>App2_v2.0_20160915_beta.apk</p><p>App3_v3.0_20150909_VendorRelease.apk</p>
If that's not adequate then edit your question to better explain your requirements and provide more truly representative sample input/output.
The above uses GNU awk for multi-char RS and RT.

Grep every word from a file starting a pattern

So I have a file let's call "page.html". Within this file, there's some links/file paths I want to extract. I've been working in BASH trying to get this right but can't seem to do it. The words/links/paths I want to grab all start with "/funny/hello/there/". The goal is for all these words to go to the terminal so I can use them.
This is kinda what I've tried so far, with no luck:
grep -E '^/funny/hello/there/` page.html
and
grep -Po '/funny/hello/there/.*?` page.html
Any help would be greatly appreciated, Thanks.
Here is sample data from the file:
`<td data-title="Blah" class="Blah" >
fdsksldjfah
</td>`
My output gives me all the different line that look like this:
fdsksldjfah
The "/fkljaskdjfl" are all something different though.
What I want the output to look like:
/funny/hello/there/fkljaskdjfl
/funny/hello/there/kfjasdflas
/funny/hello/there/kdfhakjasa

You can use this grep command:
grep -o "/funny/hello/there/[^'\"[:blank:]]*" page.html
However one should avid parsing HTML using shell utilities and use dedicated HTML dom parsers instead.

Find and replace in file with script

I want to find and replace the VALUE into a xml file :
<test name="NAME" value="VALUE"/>
I have to filter by name (because there are lot of lines like that).
Is it possible ?
Thanks for you help.

Since you tagged the question "bash", I assume that you're not trying to use an XML library (although I think an XML expert might be able to give you something like an XSLT processor command that solves this question very robustly), but that you're simply interested in doing search & replace from the commandline.
I am using perl for this:
perl -pi -e 's#VALUE#replacement#g' *.xml
See perlrun man page: Very shortly put, the -p switches perl into text processing mode, -i stands for "in-place", and -e let's you specify an expression to apply to all lines of input.
Also note (if you are not too familiar with that already) that you may use other characters than # (common ones are %, a comma, etc.) that don't clash with your search & replacement strings.
There is one small caveat: perl will read & write all files given on the commandline, even those that did not change. Thus, the files' modification times will be updated even if they did not change. (I usually work around that with some more shell magic, e.g. using grep -l or grin -l to select files for perl to work on.)
EDIT: If I understand your comments correctly, you also need help with the regular expression to apply. Let me briefly suggest something like this then:
perl -pi -e 's,(name="NAME" value=)"[^"]*",\1"NEWVALUE",g' *.xml

Related: bash XHTML parsing using xpath
You can use SED:
SED 's/\(<test name=\"NAME\"\) value=\"VALUE\"/\1 value=\"YourValue\"/' test.xml
where test.xml is the xml document containing the given node. This is very fragile, and you can work to make it more flexible if you need to do this substitution multiple times. For instance, the current statement is case sensitive, so it won't substitute the value on a node with the name="name", but you can add a case insensitivity flag to the end of the statement, like so:
('s/\(<test name=\"NAME\"\) value=\"VALUE\"/\1 value=\"YourValue\"/I').
Another option would be to use XSLT, but it would require you to download an external library. It's pretty versatile, and could be a viable option for more complex modifications to an XML document.

Replace last line of XML file

Looking for help creating a script that will replace the last line of an XML file with a tag. I have a few hundred files so I'm looking for something that will process them in a loop. I've managed to rename the files sequentially like this:
posts1.xml
posts2.xml
posts3.xml
etc...
to make it easier to loop through. But I have no idea how to write a script to do this. I'm open to using either Linux or Windows (but i would guess that Linux is better for this kind of task).

So if you want to append a line to every file:
sed -i '$a<YOUR_SHINY_NEW_TAG>' *xml
To replace the last line:
sed -i '$s/.*/<YOUR_SHINY_NEW_TAG>/' *xml
But do note, sed is not the ideal tool to modify xml.

XMLStarlet is a command-line toolkit for performing XML parsing and manipulations. Note that as an XML-aware toolkit, it'll respect XML structure, character encoding and entity substitution.
Check out the ed command to see how to modify documents. You can wrap this in a standard bash loop.
e.g. in a doc consisting of a chain of <elem>s, you can add a following <added>5</added>:
mkdir new
for x in *.xml; do
xmlstarlet ed -a "//elem[count(//elem)]" -t elem -n added -v 5 $x > new/$x
done

Linux way using sed:
To edit the last line of the file in place, you can use sed:
sed -i '$s_pattern_replacement_' filename
To change the whole line to "replacement" use $s_.*_replacement_. Be sure to escape any _'s in replacement with a \.
To loop over files, just use for:
for f in /path/posts*.xml; do sed -i '$s_.*_replacement_' $f; done
This, however, is a dirty way as it's not aware of the XML structure, whereas the XML structure is not affected by newlines. You have to be sure the last line of the files contains exactly what you expect it to.

It makes little to no difference whether you're on Linux, Windows or MacOS
The question is what language do you want to use?
The following is an example in c# (not optimized, but read it as speudocode):
string rootDirectory = #"c:\myfiles";
var files = Directory.GetFiles(rootDirectory, "*.xml");
foreach (var file in files)
{
var lines = File.ReadAllLines(file);
lines[lines.Length - 1] = "whatever you want here";
File.WriteAllLines(file, lines);
}
You can compile this and run it on Windows, Linux, etc..
Or you could do the same in Python.
Of course this method does not actually parse the XML,
but you just wanted to replace the last line right?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

processing xml files with bash scripting - bash

Related

Linux sed command that generates a new file on every regex match

Issue with bash script using SED/AWK for substituion

Grep every word from a file starting a pattern

Find and replace in file with script

Replace last line of XML file

Categories

Resources