Faster textual data processing in BASH - bash

I've got a speed question. I have a bash script which parses information from TheTvDb.com. It downloads nearly 40,000 lines of data, then reduces it down to about 5000 lines of data which gets written to the hard disk. Then it reads the file and parses it into several files which are used later as a lookup table. It's basically taking all the information it sees before each "/Episode" and writing it to a specific file, then resetting for the next one.
It has to synchronize on the "/Episode" tag because there is a "FirstAired" tag outside of the episode tags. This ensures that the data is drawn in sequence rather then depending on each individual tag to be relating to a episode.
here is the code in question.
if [ -f "$mythicalLibrarian/$NewShowName/$NewShowName.xml" ]; then
Ename=""
actualEname=""
FAired=""
SeasonNr=""
EpisodeNr=""
recordNumber=0
echo "Parsing Downloaded information: $NewShowName.xml "
while read line
do
if [[ $line == \<\/Episode\> ]]; then
(( ++recordNumber ))
echo -ne "Building Record:$recordNumber ${actualEname:0:20} \r" 1>&2
echo "$actualEname" >> "$mythicalLibrarian/$NewShowName/$NewShowName.actualEname.txt"&
Ename=`echo "$actualEname" |sed 's/;.*//'`
echo "$Ename" >> "$mythicalLibrarian/$NewShowName/$NewShowName.Ename.txt"&
echo "$FAired" >> "$mythicalLibrarian/$NewShowName/$NewShowName.FAired.txt"&
echo "$SeasonNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.S.txt"&
echo "$EpisodeNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.E.txt"&
Ename=""
actualEname=""
FAired=""
SeasonNr=""
EpisodeNr=""
#Get actual show name
elif [[ $line == \<EpisodeName\>* ]]; then
actualEname=`echo "$line" | sed -e s/'<\/EpisodeName>'// -e s/'<EpisodeName>'// -e s/'\&amp\;'/'\&'/ -e s/'\&quot\;'/'\"'/ -e s/'\&amp\;'/'\&'/ -e s/'\&ndash\;'/'-'/ -e s/'\&lt\;'/'\<'/ -e 's/'\&gt\;'/'\>'/' |tr -d '|\?\*\<\"\:\>\+\\\[\]\/'`
#Get OriginalAirDate
elif [[ $line == \<FirstAired\>* ]]; then
FAired=`echo "$line" | sed -e s/'<FirstAired>'//g -e s/'<\/FirstAired>'//g`
#Get Season number
elif [[ $line == \<SeasonNumber\>* ]]; then
SeasonNr=`echo "$line" |sed -e s/'<SeasonNumber>'// -e s/'<\/SeasonNumber>'//`
#Get Episode number
elif [[ $line == \<EpisodeNumber\>* ]]; then
EpisodeNr=`echo "$line" |sed -e 's/<EpisodeNumber>//' -e 's/<\/EpisodeNumber>//'`
fi
done < "$mythicalLibrarian/$NewShowName/$NewShowName.xml"
chmod 777 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".actualEname.txt
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".Ename.txt
chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".FAired.txt
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".S.txt
chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".E.txt
GotNewInformation=1
elif [ ! -f "$mythicalLibrarian/$NewShowName/$NewShowName.xml" ]; then
echo "COULD NOT DOWNLOAD:www.thetvdb.com/api/$APIkey/series/$SeriesID/all/$Language.xml">>"$mythicalLibrarian"/output.log
fi
Here is some of the data it is processing
<?xml version="1.0" encoding="UTF-8" ?>
<Data><Series>
<Actors>|Fred Rogers|Adair Roth|Bert Lloyd|Bud Alder|Carol Saunders|Carole Switala|Deborah Neal Stampo|Don Brockett|Elsie Neal|Emilie Jacobson|Fred Michael|John Reardon|Jos|Judy Rubin|Keith David|Lenny Meledandri|Michael Horton|Robert Trow|Yoshi Ito|</Actors>
<Airs_DayOfWeek></Airs_DayOfWeek>
<Airs_Time></Airs_Time>
<ContentRating></ContentRating>
<FirstAired>1968-02-01</FirstAired>
<Genre>|Children|</Genre>
<Network>PBS</Network>
<NetworkID></NetworkID>
<Overview>"In a little toy neighborhood, a tiny trolley rolls past a house at the end of a street.
<Runtime>30</Runtime>
<SeriesID>6843</SeriesID>
<SeriesName>Mister Rogers' Neighborhood</SeriesName>
<Status>Ended</Status>
<added></added>
<addedBy></addedBy>
<banner>graphical/77750-g.jpg</banner>
<fanart>fanart/original/77750-1.jpg</fanart>
<poster></poster>
<zap2it_id>SH002930</zap2it_id>
</Series>
<Episode>
<EpisodeName>Change (1)</EpisodeName>
<EpisodeNumber>1</EpisodeNumber>
<FirstAired>1968-02-19</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Change (2)</EpisodeName>
<EpisodeNumber>2</EpisodeNumber>
<FirstAired>1968-02-20</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Change (3)</EpisodeName>
<EpisodeNumber>3</EpisodeNumber>
<FirstAired>1968-02-21</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Change (4)</EpisodeName>
<EpisodeNumber>4</EpisodeNumber>
<FirstAired>1968-02-22</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Change (5)</EpisodeName>
<EpisodeNumber>5</EpisodeNumber>
<FirstAired>1968-02-23</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 6</EpisodeName>
<EpisodeNumber>6</EpisodeNumber>
<FirstAired>1968-02-26</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 7</EpisodeName>
<EpisodeNumber>7</EpisodeNumber>
<FirstAired>1968-02-27</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 8</EpisodeName>
<EpisodeNumber>8</EpisodeNumber>
<FirstAired>1968-02-28</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 9</EpisodeName>
<EpisodeNumber>9</EpisodeNumber>
<FirstAired>1968-02-29</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 10</EpisodeName>
<EpisodeNumber>10</EpisodeNumber>
<FirstAired>1968-03-01</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 11</EpisodeName>
<EpisodeNumber>11</EpisodeNumber>
<FirstAired>1968-03-04</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 12</EpisodeName>
<EpisodeNumber>12</EpisodeNumber>
<FirstAired>1968-03-05</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 13</EpisodeName>
<EpisodeNumber>13</EpisodeNumber>
<FirstAired>1968-03-06</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 14</EpisodeName>
<EpisodeNumber>14</EpisodeNumber>
<FirstAired>1968-03-07</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 15</EpisodeName>
<EpisodeNumber>15</EpisodeNumber>
<FirstAired>1968-03-08</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Welcome Donkey Hodie (1)</EpisodeName>
<EpisodeNumber>16</EpisodeNumber>
<FirstAired>1968-03-11</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Welcome Donkey Hodie (2)</EpisodeName>
<EpisodeNumber>17</EpisodeNumber>
<FirstAired>1968-03-12</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Welcome Donkey Hodie (3)</EpisodeName>
<EpisodeNumber>18</EpisodeNumber>
<FirstAired>1968-03-13</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Welcome Donkey Hodie (4)</EpisodeName>
<EpisodeNumber>19</EpisodeNumber>
<FirstAired>1968-03-14</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Welcome Donkey Hodie (5)</EpisodeName>
<EpisodeNumber>20</EpisodeNumber>
<FirstAired>1968-03-15</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 21</EpisodeName>
<EpisodeNumber>21</EpisodeNumber>
<FirstAired>1968-03-18</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 22</EpisodeName>
<EpisodeNumber>22</EpisodeNumber>
<FirstAired>1968-03-19</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 23</EpisodeName>
<EpisodeNumber>23</EpisodeNumber>
<FirstAired>1968-03-20</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 24</EpisodeName>
<EpisodeNumber>24</EpisodeNumber>
<FirstAired>1968-03-21</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 25</EpisodeName>
<EpisodeNumber>25</EpisodeNumber>
<FirstAired>1968-03-22</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 26</EpisodeName>
<EpisodeNumber>26</EpisodeNumber>
<FirstAired>1968-03-25</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 27</EpisodeName>
<EpisodeNumber>27</EpisodeNumber>
<FirstAired>1968-03-26</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 28</EpisodeName>
<EpisodeNumber>28</EpisodeNumber>
<FirstAired>1968-03-27</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 29</EpisodeName>
<EpisodeNumber>29</EpisodeNumber>
<FirstAired>1968-03-28</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 30</EpisodeName>
<EpisodeNumber>30</EpisodeNumber>
<FirstAired>1968-03-29</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Red Monster (1)</EpisodeName>
<EpisodeNumber>31</EpisodeNumber>
<FirstAired>1968-04-01</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Red Monster (2)</EpisodeName>
<EpisodeNumber>32</EpisodeNumber>
<FirstAired>1968-04-02</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Red Monster (3)</EpisodeName>
<EpisodeNumber>33</EpisodeNumber>
<FirstAired>1968-04-03</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Red Monster (4)</EpisodeName>
<EpisodeNumber>34</EpisodeNumber>
<FirstAired>1968-04-04</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Red Monster (5)</EpisodeName>
<EpisodeNumber>35</EpisodeNumber>
<FirstAired>1968-04-05</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 36</EpisodeName>
<EpisodeNumber>36</EpisodeNumber>
<FirstAired>1968-04-08</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 37</EpisodeName>
<EpisodeNumber>37</EpisodeNumber>
<FirstAired>1968-04-09</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 38</EpisodeName>
<EpisodeNumber>38</EpisodeNumber>
<FirstAired>1968-04-10</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 39</EpisodeName>
<EpisodeNumber>39</EpisodeNumber>
<FirstAired>1968-04-11</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 40</EpisodeName>
<EpisodeNumber>40</EpisodeNumber>
<FirstAired>1968-04-12</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 41</EpisodeName>
<EpisodeNumber>41</EpisodeNumber>
<FirstAired>1968-04-15</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 42</EpisodeName>
<EpisodeNumber>42</EpisodeNumber>
<FirstAired>1968-04-16</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 43</EpisodeName>
<EpisodeNumber>43</EpisodeNumber>
<FirstAired>1968-04-17</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 44</EpisodeName>
<EpisodeNumber>44</EpisodeNumber>
<FirstAired>1968-04-18</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 45</EpisodeName>
<EpisodeNumber>45</EpisodeNumber>
<FirstAired>1968-04-19</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 46</EpisodeName>
<EpisodeNumber>46</EpisodeNumber>
<FirstAired>1968-04-22</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 47</EpisodeName>
<EpisodeNumber>47</EpisodeNumber>
<FirstAired>1968-04-23</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 48</EpisodeName>
<EpisodeNumber>48</EpisodeNumber>
<FirstAired>1968-04-24</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 49</EpisodeName>
<EpisodeNumber>49</EpisodeNumber>
<FirstAired>1968-04-25</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 50</EpisodeName>
<EpisodeNumber>50</EpisodeNumber>
<FirstAired>1968-04-26</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 51</EpisodeName>
<EpisodeNumber>51</EpisodeNumber>
<FirstAired>1968-04-29</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 52</EpisodeName>
<EpisodeNumber>52</EpisodeNumber>
<FirstAired>1968-04-30</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 53</EpisodeName>
<EpisodeNumber>53</EpisodeNumber>
<FirstAired>1968-05-01</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 54</EpisodeName>
<EpisodeNumber>54</EpisodeNumber>
<FirstAired>1968-05-02</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 55</EpisodeName>
<EpisodeNumber>55</EpisodeNumber>
<FirstAired>1968-05-03</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 56</EpisodeName>
<EpisodeNumber>56</EpisodeNumber>
<FirstAired>1968-05-06</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 57</EpisodeName>
<EpisodeNumber>57</EpisodeNumber>
<FirstAired>1968-05-07</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 58</EpisodeName>
<EpisodeNumber>58</EpisodeNumber>
<FirstAired>1968-05-08</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 59</EpisodeName>
<EpisodeNumber>59</EpisodeNumber>
<FirstAired>1968-05-09</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 60</EpisodeName>
<EpisodeNumber>60</EpisodeNumber>
<FirstAired>1968-05-10</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 61</EpisodeName>
<EpisodeNumber>61</EpisodeNumber>
<FirstAired>1968-05-13</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 62</EpisodeName>
<EpisodeNumber>62</EpisodeNumber>
<FirstAired>1968-05-14</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 63</EpisodeName>
<EpisodeNumber>63</EpisodeNumber>
<FirstAired>1968-05-15</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 64</EpisodeName>
<EpisodeNumber>64</EpisodeNumber>
<FirstAired>1968-05-16</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 65</EpisodeName>
<EpisodeNumber>65</EpisodeNumber>
<FirstAired>1968-05-17</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 66</EpisodeName>
<EpisodeNumber>66</EpisodeNumber>
<FirstAired>1968-05-20</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 67</EpisodeName>
<EpisodeNumber>67</EpisodeNumber>
<FirstAired>1968-05-21</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 68</EpisodeName>
<EpisodeNumber>68</EpisodeNumber>
<FirstAired>1968-05-22</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 69</EpisodeName>
<EpisodeNumber>69</EpisodeNumber>
<FirstAired>1968-05-23</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 70</EpisodeName>
<EpisodeNumber>70</EpisodeNumber>
<FirstAired>1968-05-24</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 71</EpisodeName>
<EpisodeNumber>71</EpisodeNumber>
<FirstAired>1968-05-27</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
<Episode>
<EpisodeName>Show 72</EpisodeName>
<EpisodeNumber>72</EpisodeNumber>
<FirstAired>1968-05-28</FirstAired>
<SeasonNumber>1</SeasonNumber>
</Episode>
The problem is that on a i7 processor this takes 14.5 seconds. It is about 10x slower on my media center. I tried using a case statement which takes 15 seconds on the fast processor.
I would like to know about how to speed this process up. It seems that this is ridiculously slow for BASH which is supposed to be designed around data manipulation and file operations.

You will get a considerable speedup by dropping the & from the end of all those echo statements.
Test1:
$ time { for i in {1..1000}; do echo "hello"& done >/dev/null; } | cat
real 0m10.357s
user 0m2.764s
sys 0m15.441s
The cat eats the "done" messages when this is done at the command line. A colon could be used instead of cat to suppress the "done" messages from the first timed test. It's not the program that's doing it, it's the fact that the backgrounded processes are part of a pipe.
Test2:
$ time { for i in {1..1000}; do echo "hello"; done >/dev/null; }
real 0m0.152s
user 0m0.132s
sys 0m0.020s
Note that this was on a very slow, old machine.
You may also get a speed improvement by using Bash's regex and string processing features instead of repeatedly spawning multiple external utilities in a loop.
Example:
elif [[ $line == \<EpisodeName\>* ]]; then
actualEname=${line//<\/EpisodeName>/}
actualEname=${actualEname//<EpisodeName>/}
actualEname=${actualEname//&/&}
actualEname=${actualEname//–/-}
for string in '|' '<' '>' '"' '?' '*' '<' '>' ':' '"' '+' '\' '[' ']' '/'
do
actualEname=${actualEname//$string}
done
You had an extra & in that line and a lot of unnecessary single quotes and escaping, by the way. Also, you're converting HTML entities and then deleting them. Why not just delete them to begin with? You also seem to be missing some g (global) modifiers.
Test3:
$ time { for i in {1..100}; do
line='<EpisodeName><foo&bar–baz>Season–3–"quux"?*<>:"+\[]/</EpisodeName>'
actualEname=$(echo "$line" | sed -e 's/<\/EpisodeName>//' -e 's/<EpisodeName>//' -e 's/&/\&/g' -e 's/"/"/g' -e 's/–/-/g' -e 's/</</g' -e 's/>/>/g' |tr -d '|?*<":>+\\[]/')
done; }
real 0m7.779s
user 0m3.164s
sys 0m5.436s
Test4:
$ time { for i in {1..100}; do
line='<EpisodeName><foo&bar–baz>Season–3–"quux"?*<>:"+\[]/</EpisodeName>
actualEname=${line//<\/EpisodeName>/}
actualEname=${actualEname//<EpisodeName>/}
actualEname=${actualEname//&/&}
actualEname=${actualEname//–/-}
for string in '|' '<' '>' '"' '\?' '\*' '<' '>' ':' '"' '+' '\\' '[' ']' '\/'
do
actualEname=${actualEname//$string}
done
done; }
real 0m5.403s
user 0m2.492s
sys 0m2.960s

Use something like XMLStarlet which is designed to process XML.

The slowdown is most likely due to the very high number of process spawns that are happening in that script (sed, tr).
You could achieve a much faster result by calling a program with an XML parser to read it in, and output to the various files. If you need to keep it in bash, maybe find something that can do XSLT to transform from the XML to the format used in the files and divide it up.
Personally I would do that sort of thing in Perl.

BASH which is supposed to be designed around data manipulation and file operations.
Bash is designed for interactive command processing and linking programs together via pipes. Heavy data processing is not the design space of any *sh that I know of.
Python or Perl would be a much better choice for the problem space.

I just tried this:
echo "Parsing Downloaded information: $NewShowName.xml "
while read line
do
if [[ $line == \<\/Episode\> ]]; then
(( ++recordNumber ))
echo -ne "Building Record:$recordNumber ${actualEname:0:20} \r" 1>&2
echo "$EpisodeName" >> "$mythicalLibrarian/$NewShowName/$NewShowName.actualEname.txt"&
Ename=`echo "$actualEname" |sed 's/;.*//'`
echo "$EpisodeName" >> "$mythicalLibrarian/$NewShowName/$NewShowName.Ename.txt"&
echo "$FirstAired" >> "$mythicalLibrarian/$NewShowName/$NewShowName.FAired.txt"&
echo "$SeasonNumber" >> "$mythicalLibrarian/$NewShowName/$NewShowName.S.txt"&
echo "$EpisodeNumber" >> "$mythicalLibrarian/$NewShowName/$NewShowName.E.txt"&
EpisodeName=""
actualEname=""
FirstAired=""
SeasonNumber=""
EpisodeNumber=""
else
var=`echo $line |tr '<>' ' '|awk '{print $1}'`
value=`echo "$line"|sed -e s/'<'"$var"'>'// -e s/'<\/'"$var"'>'// -e s/'\&amp\;'/'\&'/ -e s/'\&quot\;'/'\"'/ -e s/'\&amp\;'/'\&'/ -e s/'\&ndash\;'/'-'/ -e s/'\&lt\;'/'\<'/ -e 's/'\&gt\;'/'\>'/' |tr -d '|\?\*\<\"\:\>\+\\\[\]\/'`
eval $var="'$value'"
fi
Which took 43 seconds on the faster processor

Holy cow Dennis Williamson, It parses in less then 1/2 second. It just flickers across the screen. It used to take 15 seconds, but now it's so quick that I can't even tell that it's happening.
These are the changes that Dennis Williamson suggested. I'm just posting it here.
echo "Parsing Downloaded information: $NewShowName.xml "
while read line
do
if [[ $line == \<\/Episode\> ]]; then
(( ++recordNumber ))
echo -ne "Building Record:$recordNumber ${actualEname:0:20} \r" 1>&2
echo "$actualEname" >> "$mythicalLibrarian/$NewShowName/$NewShowName.actualEname.txt"
echo "$Ename" >> "$mythicalLibrarian/$NewShowName/$NewShowName.Ename.txt"
echo "$FAired" >> "$mythicalLibrarian/$NewShowName/$NewShowName.FAired.txt"
echo "$SeasonNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.S.txt"
echo "$EpisodeNr" >> "$mythicalLibrarian/$NewShowName/$NewShowName.E.txt"
Ename=""
actualEname=""
FAired=""
SeasonNr=""
EpisodeNr=""
#Get actual show name
elif [[ $line == \<EpisodeName\>* ]]; then
line=${line/<\/EpisodeName>/}
line=${line/<EpisodeName>/}
line=${line/<}
line=${line/>/}
line=${line/"/}
line=${line/&/&}
line=${line/\|/}
line=${line/\?/}
line=${line/\*/}
line=${line/\:/}
line=${line/\+/}
line=${line/\\/}
line=${line/\//}
line=${line/\[/}
line=${line/\]/}
line=${line/\'/}
line=${line/\"/}
actualEname=${line/–/-}
Ename=${actualEname/;*/}
#Get OriginalAirDate
elif [[ $line == \<FirstAired\>* ]]; then
line=${line/<\/FirstAired>/}
line=${line/<FirstAired>/}
FAired=$line
#Get Season number
elif [[ $line == \<SeasonNumber\>* ]]; then
line=${line/<\/SeasonNumber>/}
line=${line/<SeasonNumber>/}
SeasonNr=$line
#Get Episode number
elif [[ $line == \<EpisodeNumber\>* ]]; then
line=${line/<\/EpisodeNumber>/}
line=${line/<EpisodeNumber>/}
EpisodeNr=$line
fi
done < "$mythicalLibrarian/$NewShowName/$NewShowName.xml"
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".actualEname.txt
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".Ename.txt
chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".FAired.txt
chmod 666 "$mythicalLibrarian"/"$NewShowName"/"$NewShowName".S.txt
chmod 666 "$mythicalLibrarian/$NewShowName/$NewShowName".E.txt
GotNewInformation=1

Related

How to check if an attribute is present in an xml node using xmllint

I'm using bash and xmllint to check nodes in the following xml:
<?xml version="1.0" encoding="utf-8"?>
<output>
<document>
<sentence id="13">
<text>This is a test sentence.</text>
<entities>
<annotation id="3">
<grammar-form id="0" normal-form="THIS"/>
</annotation>
<annotation id="4">
<grammar-form id="0" normal-form="IS"/>
</annotation>
<annotation id="5">
<grammar-form id="0" normal-form="A"/>
</annotation>
<annotation id="6">
<grammar-form id="0" normal-form="TEST"/>
</annotation>
<annotation id="7">
<grammar-form id="0" normal-form="SENTENCE"/>
</annotation>
<annotation id="12">
<grammar-form id="0" normal-form="."/>
</annotation>
</entities>
</sentence>
</document>
</output>
How can I simply check that each grammar-form node has a normal-form attribute present? It doesn't matter what the attribute value is, I just need to check that it is present.
It's easier to select grammar-forms that don't have the attribute and see if you get any matches or not:
if xmllint --xpath '//grammar-form[not(#normal-form)]' input.xml 1>/dev/null 2>&1; then
echo "There are missing normal forms."
else
echo "There are no missing normal forms."
fi
In xpath mode, xmllint will print the matching paths, or if nothing matches, exit with a error code of 10 and print a message to that effect to standard error (the --noout option mentioned in the manpage to suppress output doesn't do anything in the version I'm testing with, unfortunately), hence the redirections.

How to redirect long string with single/double quotes var in Rundeck?

thank you for taking time reading this question.
I've a Rundeck job with multiple steps. Basically, step 1 and 2 is fetching a long string which is under ' '. Example:
'This is a long string.. and is also under "double quotes" '. -> This variable is stored as the following form: #option.mylongstring#
Third step of my Rundeck job is failing because I'm having issues with single and multiple quotes in my string. I want to extract specific values from that long string
My solution was to send the content of #option.mylongstring# in a temp file and apply sed to convert single quotes into double quotes (sed "s/'/\"/g") and from there, extract the information that I need.
Anyway, seems that the redirection is not happening in Rundeck: echo #option.mylongstring# &> $TEMPFILE is doing nothing, generating an empty file.
Anyone faced the same issue?
Using inline-script works without problems, let me share the job definition example:
<joblist>
<job>
<defaultTab>nodes</defaultTab>
<description></description>
<executionEnabled>true</executionEnabled>
<id>5e7123ce-c9b7-4bfa-a0e8-6484a9bd7c4f</id>
<loglevel>INFO</loglevel>
<name>LongStringExample</name>
<nodeFilterEditable>false</nodeFilterEditable>
<plugins />
<scheduleEnabled>true</scheduleEnabled>
<sequence keepgoing='false' strategy='node-first'>
<command>
<fileExtension>.sh</fileExtension>
<script><![CDATA[echo 'hello "world"' > myfile.txt]]></script>
<scriptargs />
<scriptinterpreter>/bin/bash</scriptinterpreter>
</command>
</sequence>
<uuid>5e7123ce-c9b7-4bfa-a0e8-6484a9bd7c4f</uuid>
</job>
</joblist>
Using an option:
<joblist>
<job>
<context>
<options preserveOrder='true'>
<option name='opt1' />
</options>
</context>
<defaultTab>nodes</defaultTab>
<description></description>
<executionEnabled>true</executionEnabled>
<id>22d7286f-7be9-4aaf-92ae-8e5bf5277d67</id>
<loglevel>INFO</loglevel>
<name>AnotherLongStringExample</name>
<nodeFilterEditable>false</nodeFilterEditable>
<plugins />
<scheduleEnabled>true</scheduleEnabled>
<sequence keepgoing='false' strategy='node-first'>
<command>
<fileExtension>.sh</fileExtension>
<script><![CDATA[echo 'this is another "#option.opt1#"' > another_file.txt]]></script>
<scriptargs />
<scriptinterpreter>/bin/bash</scriptinterpreter>
</command>
</sequence>
<uuid>22d7286f-7be9-4aaf-92ae-8e5bf5277d67</uuid>
</job>
</joblist>

Unix Script to split one big file into multiple files with two pairs of a tag each in a file with a naming convention on filename

I am writing a Shell script to split one big file into multiple files with two pairs of a tag each in a file and those small filenames must follow a naming convention.
Example:-
Big File Name : abcdef123.xml
Contents:
<parent>
<child>
<code1><code1>
<text1><text1>
</child>
<child1>
<code2><code2>
<text2><text2>
</child1>
<child>
<code3><code3>
<text3><text3>
</child>
<child1>
<code4><code4>
<text4><text4>
</child1>
<child>
<code5><code5>
<text5><text5>
</child>
<child1>
<code6><code6>
<text6><text6>
</child1>
<child>
<code7><code7>
<text7><text7>
</child>
<child1>
<code8><code8>
<text8><text8>
</child1>
</parent>
The Unix shell script should split this big file into multiple files (with 2 pairs of <child> & <child1> each in the file) having the following criteria and take user input for file name convention (the date with miliseconds can remain same in all file name but variable 'j' should change):-
Criteria:-
Add header '<parent>' and tail '</parent>' to each file.
File name should be in format of 'UserinputstringMMDDYYYYHHMMSSMIL_n increment.xml' (where MIL is milliseconds and 'n increment' will be like 001, 002, 003....)
No two file should have same filename
Example Big File splits:-
file 1= stack_10132020134434789_001.xml
Contents :-
<parent>
<child>
<code1><code1>
<text1><text1>
</child>
<child1>
<code2><code2>
<text2><text2>
</child1>
<child>
<code3><code3>
<text3><text3>
</child>
<child1>
<code4><code4>
<text4><text4>
</child1>
</parent>
file 2= stack_10132020134434791_002.xml
Contents :-
<parent>
<child>
<code5><code5>
<text5><text5>
</child>
<child1>
<code6><code6>
<text6><text6>
</child1>
<child>
<code7><code7>
<text7><text7>
</child>
<child1>
<code8><code8>
<text8><text8>
</child1>
</parent>
Script I was trying :-
csplit -ksf part. src.xml
n=000
E.g. Enter beginning of file name :
User entered-> stack
read userinput
j=n+1
$date= date +%m%d%Y%H%M%S%3N
filename=$userinput$date_$j.xml```
sample.xml:
<?xml version="1.0"?>
<parent>
<child>
<code1>aa</code1>
<text1>aat</text1>
</child>
<child1>
<code2>aa2</code2>
<text2>aat2</text2>
</child1>
<child>
<code3>bb</code3>
<text3>bbt</text3>
</child>
<child1>
<code4>bb2</code4>
<text4>bbt2</text4>
</child1>
<child>
<code5>cc</code5>
<text5>cct</text5>
</child>
<child1>
<code6>cc2</code6>
<text6>cct2</text6>
</child1>
<child>
<code7>dd</code7>
<text7>ddt</text7>
</child>
<child1>
<code8>dd2</code8>
<text8>ddt2</text8>
</child1>
</parent>
parser.sh
#!/bin/bash
PARENT='parent'
CHILD1='child'
CHILD2='child1'
INPUT_FILE='sample.xml'
NUM_OF_CHILDS=$(cat $INPUT_FILE | grep "<$CHILD1>" | wc -l)
FILE_NUM=1
for i in $(seq 1 2 $NUM_OF_CHILDS); do
echo "-----------------------------------------------------"
echo "FILENAME_"$(date +%s%N)"_$FILE_NUM.xml"
echo "-----------------------------------------------------"
echo '<?xml version="1.0"?>'
echo '<'$PARENT'>'
xmllint --xpath "(//parent/$CHILD1[$i])" $INPUT_FILE
xmllint --xpath "(//parent/$CHILD2[$i])" $INPUT_FILE
xmllint --xpath "(//parent/$CHILD1[$(( i + 1 ))])" $INPUT_FILE
xmllint --xpath "(//parent/$CHILD2[$(( i + 1 ))])" $INPUT_FILE
echo '</'$PARENT'>'
FILE_NUM=$(( FILE_NUM + 1 ))
done
output:
-----------------------------------------------------
FILENAME_1603633647540475038_1.xml
-----------------------------------------------------
<?xml version="1.0"?>
<parent>
<child>
<code1>aa</code1>
<text1>aat</text1>
</child>
<child1>
<code2>aa2</code2>
<text2>aat2</text2>
</child1>
<child>
<code3>bb</code3>
<text3>bbt</text3>
</child>
<child1>
<code4>bb2</code4>
<text4>bbt2</text4>
</child1>
</parent>
-----------------------------------------------------
FILENAME_1603633647547254647_2.xml
-----------------------------------------------------
<?xml version="1.0"?>
<parent>
<child>
<code5>cc</code5>
<text5>cct</text5>
</child>
<child1>
<code6>cc2</code6>
<text6>cct2</text6>
</child1>
<child>
<code7>dd</code7>
<text7>ddt</text7>
</child>
<child1>
<code8>dd2</code8>
<text8>ddt2</text8>
</child1>
</parent>
Looks like you plan to use output files as xml, so indents and newlines not matter. In other case try to play with xmllint parameters.
Other details such as file naming convention are easy to change, so it's up to you.

Android NDK Make File and Maven Build Issues

Let me just say that I'm pretty new to Android NDK and so, I've been trying to go through Androids documentation on it. I've come across some issues with it when trying to utilize it in Maven (via plugins). My maven plugins snippets are below as well as my Android.mk file.
pom.xml (plugins portion):
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<compilerArgument>-AguiceAnnotationDatabasePackageName=my.package.name</compilerArgument>
<fork>true</fork>
</configuration>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>native-maven-plugin</artifactId>
<version>1.0-alpha-8</version>
<executions>
<execution>
<goals>
<goal>javah</goal>
</goals>
<phase>compile</phase>
<configuration>
<javahClassNames>
<javahClassName>my.package.name.MyClass</javahClassName>
</javahClassNames>
<javahVerbose>true</javahVerbose>
<javahPath>$(THE_JAVA_PATH)\bin\javah.exe</javahPath>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>com.simpligility.maven.plugins</groupId>
<artifactId>android-maven-plugin</artifactId>
<extensions>true</extensions>
<configuration>
<manifest>
<debuggable>true</debuggable>
<usesSdk>
<minSdkVersion>17</minSdkVersion>
<targetSdkVersion>18</targetSdkVersion>
</usesSdk>
</manifest>
<apk>
<debug>true</debug>
</apk>
<extractDuplicates>true</extractDuplicates>
<dex>
<coreLibrary>true</coreLibrary>
<jvmArguments>
<jvmArgument>-Xmx2048m</jvmArgument>
</jvmArguments>
</dex>
<nativeLibrariesDirectory>${to.ndk.libs}</nativeLibrariesDirectory>
<ndkOutputDirectory>${to.ndk.objs}/local</ndkOutputDirectory>
</configuration>
</plugin>
<plugin>
<groupId>com.simpligility.maven.plugins</groupId>
<artifactId>android-ndk-maven-plugin</artifactId>
<version>1.0.1-SNAPSHOT</version>
<executions>
<execution>
<goals>
<goal>ndk-build</goal>
</goals>
<configuration>
<target>${project.artifactId}</target>
<finalLibraryName>${project.artifactId}</finalLibraryName>
<ndkPath>$(THE_NDK_PATH)</ndkPath>
<makefile>jni/Android.mk</makefile>
<applicationMakefile>jni/Application.mk</applicationMakefile>
<architectures>${arch}</architectures>
<additionalCommandline>${ndk.args}</additionalCommandline>
<librariesOutputDirectory>${to.ndk.libs}</librariesOutputDirectory>
<objectsOutputDirectory>${to.ndk.objs}</objectsOutputDirectory>
<headerFilesDirectives>
<headerFilesDirective>
<directory>${basedir}/jni</directory>
<includes>
<include>**\/*.h</include>
</includes>
</headerFilesDirective>
<headerFilesDirective>
<directory>${project.build.directory}/native/javah</directory>
<includes>
<include>**\/*.h</include>
</includes>
</headerFilesDirective>
</headerFilesDirectives>
</configuration>
</execution>
</executions>
</plugin>
My directory structure is as follows MyRoot -> jni - > (c/cpp files). And of course the standard directory structure for java files. My Android.mk file is the following:
LOCAL_PATH := $(call my-dir)
include $(CLEAR_VARS)
LOCAL_MODULE := my-lib
LOCAL_SRC_FILES := MyClass1.cpp \
MyFile1.c \
MyClass2.cpp
LOCAL_C_INCLUDES := $(LOCAL_PATH) \
$(LOCAL_PATH)/../target/native/javah
LOCAL_LDLIBS := -llog
LOCAL_CPP_FEATURES := rtti exceptions
LOCAL_CFLAGS += \
-D _NX_FEATURE_ATOMIC_C_PLUS_PLUS_11_ \
-D _NX_FEATURE_CAN_BUS_INTERFACE_ROUTER_ \
-D _NX_FEATURE_CAN_BUS_CREATED_BY_CONFIGURATION_ \
-D _NX_FEATURE_CAN_BUS_TC_SERVICE_ \
-D _NX_FEATURE_CRC_ \
-D _NX_FEATURE_EXCEPTIONS_ \
-D _NX_FEATURE_FILE_SUPPORT_CRT_ \
-D _NX_FEATURE_FLOAT_64_ \
-D _NX_FEATURE_LOG_ \
-D _NX_FEATURE_MUTEX_PTHREAD_ \
-D _NX_FEATURE_POSIX_SIGNAL_HANDLER_ \
-D _NX_FEATURE_RANDOM_ \
-D _NX_FEATURE_SECURITY_UNSET_INTENTIONALLY_ \
-D _NX_FEATURE_THREAD_ \
-D _NX_FEATURE_TRACE_STDIO_ \
-D _NX_FEATURE_TCP_IP_ \
-D _NX_FEATURE_XML_PARSER_ \
-D NX_CUSTOMER_FAR \
-D __STDC_FORMAT_MACROS
-D HAVE_FTRUNCATE=1 \
-D HAVE_GETCWD=1 \
-D HAVE_GETPAGESIZE=1 \
-D HAVE_GETTIMEOFDAY=1 \
-D HAVE_INTTYPES_H=1 \
-D HAVE_MALLOC=1 \
-D HAVE_MEMCHR=1 \
-D HAVE_MEMMOVE=1 \
-D HAVE_MEMORY_H=1 \
-D HAVE_MEMSET=1 \
-D HAVE_MKDIR=1 \
-D HAVE_MMAP=1 \
-D HAVE_MUNMAP=1 \
-D HAVE_NETDB_H=1 \
-D HAVE_PTRDIFF_T=1 \
-D HAVE_RMDIR=1 \
-D HAVE_SELECT=1 \
-D HAVE_SOCKET=1 \
-D HAVE_STDDEF_H=1 \
-D HAVE_STDINT_H=1 \
-D HAVE_STDLIB_H=1 \
-D HAVE_STRINGS_H=1 \
-D HAVE_STRING_H=1 \
-D HAVE_STRPBRK=1 \
-D HAVE_STRRCHR=1 \
-D HAVE_STRSPN=1 \
-D HAVE_STRTOUL=1 \
-D HAVE_STRTOULL=1 \
-D HAVE_SYS_PARAM_H=1 \
-D HAVE_SYS_SOCKET_H=1 \
-D HAVE_SYS_STAT_H=1 \
-D HAVE_SYS_TIME_H=1 \
-D HAVE_SYS_TYPES_H=1 \
-D HAVE_TERMIOS_H=1 \
-D HAVE_UNISTD_H=1
LOCAL_STATIC_LIBRARIES := $(ANDROID_MAVEN_PLUGIN_LOCAL_STATIC_LIBRARIES)
LOCAL_SHARED_LIBRARIES := $(ANDROID_MAVEN_PLUGIN_LOCAL_SHARED_LIBRARIES)
include $(BUILD_SHARED_LIBRARY)
# Important: Must be the last import in order for Android Maven Plugins paths to work
include $(ANDROID_MAVEN_PLUGIN_MAKEFILE)
So the error I'm getting Is the following:
...\android-ndk-r10e\ndk-build.cmd -C ...\MyRoot APP_BUILD_SCRIPT=jni/Android.mk NDK_APPLICATION_MK=jni/Application.mk NDK_TOOLCHAIN=x86_64-4.9 APP_ABI=x86_64 V=1 -B NDK_DEBUG=1 NDK_LIBS_OUT=...\MyRoot\target\ndk-libs NDK_OUT=...\MyRoot\target\ndk-obj MyRoot
make.exe: *** No rule to make target `MyRoot'. Stop.
I'm not sure why MyRoot is even being used. And when I do the command manually wihtout the 'MyRoot' the build process starts but it doesn't seem to use any of my include files listed in my Android.mk file (LOCAL_C_INCLUDES).
Its probably something silly but I'm at a lost here. Any help is appreciated.
At the end of your command you have a floating 'MyRoot'. I'm pretty sure removing that will resolve this error.
...\android-ndk-r10e\ndk-build.cmd -C ...\MyRoot \
APP_BUILD_SCRIPT=jni/Android.mk NDK_APPLICATION_MK=jni/Application.mk \
NDK_TOOLCHAIN=x86_64-4.9 APP_ABI=x86_64 V=1 -B NDK_DEBUG=1 \
NDK_LIBS_OUT=...\MyRoot\target\ndk-libs \
NDK_OUT=...\MyRoot\target\ndk-obj MyRoot # this MyRoot is unnecessary
You will also find that many of these options (APP_ABI, NDK_TOOLCHAIN, etc.) are unnecessary if your Application.mk and directory structure are set up correctly.

Running shell scripts from Mule ESB

I have a flow set up to recognize when a file is dropped into a directory. Next I need to run a Bash script that processes the file (fairly intensive processing). The script grabs a PDF, creates a temporary directory, breaks the PDF into separate PNG files, runs an OCR processor against each image, converts the result to single-page PDFs, then merges all of the PDFs into a single multi-page PDF with the text layer from the OCR.
The problem is, the Bash script chokes after 10 concurrent transformations are triggered. Right now I have Mule ESB listening for new files, then triggering the script for each file, passing the appropriate parameters. Unfortunately, Mule has two tasks, listen -> trigger. We are going to have over 200 files in that directory that need to be queued for processing, preferably 5 at a time. How do I get Mule to limit the number of concurrent processes triggered?
Below is my initial draft Flow:
<?xml version="1.0" encoding="UTF-8"?>
<mule xmlns:cxf="http://www.mulesoft.org/schema/mule/cxf" xmlns:scripting="http://www.mulesoft.org/schema/mule/scripting" xmlns:http="http://www.mulesoft.org/schema/mule/http" xmlns:file="http://www.mulesoft.org/schema/mule/file" xmlns="http://www.mulesoft.org/schema/mule/core" xmlns:doc="http://www.mulesoft.org/schema/mule/documentation" xmlns:spring="http://www.springframework.org/schema/beans" version="CE-3.3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="
http://www.mulesoft.org/schema/mule/file http://www.mulesoft.org/schema/mule/file/current/mule-file.xsd
http://www.mulesoft.org/schema/mule/scripting http://www.mulesoft.org/schema/mule/scripting/current/mule-scripting.xsd
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-current.xsd
http://www.mulesoft.org/schema/mule/core http://www.mulesoft.org/schema/mule/core/current/mule.xsd
http://www.mulesoft.org/schema/mule/cxf http://www.mulesoft.org/schema/mule/cxf/current/mule-cxf.xsd
http://www.mulesoft.org/schema/mule/http http://www.mulesoft.org/schema/mule/http/current/mule-http.xsd ">
<configuration>
<default-threading-profile doThreading="false"/>
</configuration>
<queued-asynchronous-processing-strategy name="limitThreads" maxThreads="2"/>
<flow name="Poll_DirectoryFlow1" doc:name="Poll_DirectoryFlow1" processingStrategy="limitThreads">
<file:inbound-endpoint path="/home/administrator/Downloads/Input" responseTimeout="10000" doc:name="File" pollingFrequency="5000" fileAge="5000">
</file:inbound-endpoint>
<scripting:component doc:name="Script">
<scripting:script engine="Groovy">
<property key="originalFilename" value="#[header:originalFilename]"/>
<scripting:text><![CDATA[def filename = message.getInboundProperty('originalFilename')
println "$filename"
def directory = message.getInboundProperty('directory')
println "$directory"
"mkdir processed".execute()
def command = ["/home/administrator/ocr.sh", "$directory/$filename", "/home/administrator/Downloads/Output/$filename"]
println "$command"
def proc = "pwd".execute()
command.execute()
proc.waitFor()
println "${proc.in.text}"]]></scripting:text>
</scripting:script>
</scripting:component>
<echo-component doc:name="Echo"/>
</flow>
</mule>
Here is the actual Bash script (gives some hints on what tools we are using):
#!/bin/bash
#Setting variables
PARAM=$#
TMPDIR=./split
INFILENAME=${1##*/}
OUTFILENAME=${2##*/}
echo "1 is $1"
echo "2 is $2"
echo "infilename is $INFILENAME"
echo "outfilename is $OUTFILENAME"
#Logging I/O filenames
echo "infile: $1" >> error.log
echo "outfile: $2" >> error.log
#If the temporary directory doesn't exist, make it
if [ ! -d "$TMPDIR" ];
then
mkdir $TMPDIR
fi
#Check to see that the correct number of params have been passed.
if [[ $PARAM -lt 2 ]];
then
echo "Usage: $0 source.pdf output.pdf"
echo "output.pdf is the desired output file"
echo "source.pdf is a file to be OCR'd"
exit 1
fi
#Make sure the input file is a PDF
if [ "${1##*.}" == "pdf" ];
then
multilayer=false
#Check to see if the input file is a multi-layered pdf with searchable text
if grep -Fl "Font" "$1"; then multilayer=true; fi
#If it's not multi-layered, then perform the OCR
if [ "$multilayer" == "false" ];
then
mkdir $TMPDIR/"$INFILENAME/"
echo "making temporary directory $TMPDIR/$INFILENAME"
#Split the PDF into pdf's of one page per df in a temporary directory
pdftk "$1" burst output "$TMPDIR/$INFILENAME/pg_%04d.pdf"
echo "burse output to $TMPDIR/$INFILENAME/pg_%04d.pdf"
mv "$1" processed/
for files in "$TMPDIR/$INFILENAME/"*
do
echo "$files"
filename=$(basename "$files")
filename="${filename%.*}"
#Convert the pdf page into an image
gs -r300 -o "$TMPDIR/$INFILENAME/$filename.jpeg" -sDEVICE=jpeg "$TMPDIR/$INFILENAME/$filename.pdf"
#Perform the OCR against the image
tesseract "$TMPDIR/$INFILENAME/$filename.jpeg" "$TMPDIR/$INFILENAME/$filename" hocr
#Combine the OCR'd image and OCR'd text into a multi-layer PDF file of that page
hocr2pdf -i "$TMPDIR/$INFILENAME/$filename.jpeg" -o "$TMPDIR/$INFILENAME/$filename.pdf" < "$TMPDIR/$INFILENAME/$filename.html"
compressed="$filename-compressed.pdf"
#Compress the multi-layered PDF of the page
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$TMPDIR/$INFILENAME/$compressed $TMPDIR/$INFILENAME/$filename.pdf"
mv "$TMPDIR/$INFILENAME/$compressed" "$TMPDIR/$INFILENAME/$filename"
done
#Concatenate all of the multiline PDF pages into a single PDF file
pdftk "$TMPDIR/$INFILENAME/"*.pdf cat output "$OUTFILENAME"
compressed="$OUTFILENAME-compressed.pdf"
#Compress the multi-layered PDF
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$compressed" "$OUTFILENAME"
mv "$compressed" "$2"
rm -rf "$TMPDIR/$INFILENAME"
else
echo "The input file is multi-layered"
mv "$1" "$2"
fi
else
echo "Please enter a valid input pdf file"
exit 2
fi
An easy solution to your problem would be not to use the threading profile based strategy you were setting up and replace the scripting componet with a pooled java component configured like the following:
<pooled-component class="org.mule.PooledComponent">
<pooling-profile exhaustedAction="WHEN_EXHAUSTED_WAIT" maxActive="0" maxWait="-1" initialisationPolicy="INITIALISE_NONE"/>
</pooled-component>
You should place the invocation of your bash script in that component. You can find the docs about it here
#genjosanzo...you put me on the right track thinking about the processing strategy. Here is the solution that ended up working:
<?xml version="1.0" encoding="UTF-8"?>
<mule xmlns:cxf="http://www.mulesoft.org/schema/mule/cxf"
xmlns:scripting="http://www.mulesoft.org/schema/mule/scripting"
xmlns:http="http://www.mulesoft.org/schema/mule/http" xmlns:file="http://www.mulesoft.org/schema/mule/file"
xmlns="http://www.mulesoft.org/schema/mule/core" xmlns:doc="http://www.mulesoft.org/schema/mule/documentation"
xmlns:spring="http://www.springframework.org/schema/beans" version="CE-3.3.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://www.mulesoft.org/schema/mule/file http://www.mulesoft.org/schema/mule/file/current/mule-file.xsd
http://www.mulesoft.org/schema/mule/scripting http://www.mulesoft.org/schema/mule/scripting/current/mule-scripting.xsd
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-current.xsd
http://www.mulesoft.org/schema/mule/core http://www.mulesoft.org/schema/mule/core/current/mule.xsd
http://www.mulesoft.org/schema/mule/cxf http://www.mulesoft.org/schema/mule/cxf/current/mule-cxf.xsd
http://www.mulesoft.org/schema/mule/http http://www.mulesoft.org/schema/mule/http/current/mule-http.xsd ">
<queued-asynchronous-processing-strategy
name="limitThreads" maxThreads="7"
doc:name="Queued Asynchronous Processing Strategy" />
<flow name="Poll_DirectoryFlow1" doc:name="Poll_DirectoryFlow1"
processingStrategy="limitThreads">
<file:inbound-endpoint path="/home/administrator/Downloads/Input"
responseTimeout="10000" doc:name="File" pollingFrequency="60000"
fileAge="5000">
<file:filename-regex-filter pattern="^.*\.(pdf)$"
caseSensitive="false" />
</file:inbound-endpoint>
<scripting:component doc:name="Script">
<scripting:script engine="Groovy">
<scripting:text><![CDATA[def filename = message.getInboundProperty('originalFilename')
println "$filename"
def directory = message.getInboundProperty('directory')
println "$directory"
"mkdir processed".execute()
def command = ["/home/administrator/ocr.sh", "$directory/$filename", "/home/administrator/Downloads/Output/$filename"]
println "$command"
def cmd = command.execute()
cmd.waitFor()
println "$filename has completed processing"]]></scripting:text>
</scripting:script>
</scripting:component>
<echo-component doc:name="Echo"/>
</flow>
</mule>

Resources