I have a flow set up to recognize when a file is dropped into a directory. Next I need to run a Bash script that processes the file (fairly intensive processing). The script grabs a PDF, creates a temporary directory, breaks the PDF into separate PNG files, runs an OCR processor against each image, converts the result to single-page PDFs, then merges all of the PDFs into a single multi-page PDF with the text layer from the OCR.
The problem is, the Bash script chokes after 10 concurrent transformations are triggered. Right now I have Mule ESB listening for new files, then triggering the script for each file, passing the appropriate parameters. Unfortunately, Mule has two tasks, listen -> trigger. We are going to have over 200 files in that directory that need to be queued for processing, preferably 5 at a time. How do I get Mule to limit the number of concurrent processes triggered?
Below is my initial draft Flow:
<?xml version="1.0" encoding="UTF-8"?>
<mule xmlns:cxf="http://www.mulesoft.org/schema/mule/cxf" xmlns:scripting="http://www.mulesoft.org/schema/mule/scripting" xmlns:http="http://www.mulesoft.org/schema/mule/http" xmlns:file="http://www.mulesoft.org/schema/mule/file" xmlns="http://www.mulesoft.org/schema/mule/core" xmlns:doc="http://www.mulesoft.org/schema/mule/documentation" xmlns:spring="http://www.springframework.org/schema/beans" version="CE-3.3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="
http://www.mulesoft.org/schema/mule/file http://www.mulesoft.org/schema/mule/file/current/mule-file.xsd
http://www.mulesoft.org/schema/mule/scripting http://www.mulesoft.org/schema/mule/scripting/current/mule-scripting.xsd
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-current.xsd
http://www.mulesoft.org/schema/mule/core http://www.mulesoft.org/schema/mule/core/current/mule.xsd
http://www.mulesoft.org/schema/mule/cxf http://www.mulesoft.org/schema/mule/cxf/current/mule-cxf.xsd
http://www.mulesoft.org/schema/mule/http http://www.mulesoft.org/schema/mule/http/current/mule-http.xsd ">
<configuration>
<default-threading-profile doThreading="false"/>
</configuration>
<queued-asynchronous-processing-strategy name="limitThreads" maxThreads="2"/>
<flow name="Poll_DirectoryFlow1" doc:name="Poll_DirectoryFlow1" processingStrategy="limitThreads">
<file:inbound-endpoint path="/home/administrator/Downloads/Input" responseTimeout="10000" doc:name="File" pollingFrequency="5000" fileAge="5000">
</file:inbound-endpoint>
<scripting:component doc:name="Script">
<scripting:script engine="Groovy">
<property key="originalFilename" value="#[header:originalFilename]"/>
<scripting:text><![CDATA[def filename = message.getInboundProperty('originalFilename')
println "$filename"
def directory = message.getInboundProperty('directory')
println "$directory"
"mkdir processed".execute()
def command = ["/home/administrator/ocr.sh", "$directory/$filename", "/home/administrator/Downloads/Output/$filename"]
println "$command"
def proc = "pwd".execute()
command.execute()
proc.waitFor()
println "${proc.in.text}"]]></scripting:text>
</scripting:script>
</scripting:component>
<echo-component doc:name="Echo"/>
</flow>
</mule>
Here is the actual Bash script (gives some hints on what tools we are using):
#!/bin/bash
#Setting variables
PARAM=$#
TMPDIR=./split
INFILENAME=${1##*/}
OUTFILENAME=${2##*/}
echo "1 is $1"
echo "2 is $2"
echo "infilename is $INFILENAME"
echo "outfilename is $OUTFILENAME"
#Logging I/O filenames
echo "infile: $1" >> error.log
echo "outfile: $2" >> error.log
#If the temporary directory doesn't exist, make it
if [ ! -d "$TMPDIR" ];
then
mkdir $TMPDIR
fi
#Check to see that the correct number of params have been passed.
if [[ $PARAM -lt 2 ]];
then
echo "Usage: $0 source.pdf output.pdf"
echo "output.pdf is the desired output file"
echo "source.pdf is a file to be OCR'd"
exit 1
fi
#Make sure the input file is a PDF
if [ "${1##*.}" == "pdf" ];
then
multilayer=false
#Check to see if the input file is a multi-layered pdf with searchable text
if grep -Fl "Font" "$1"; then multilayer=true; fi
#If it's not multi-layered, then perform the OCR
if [ "$multilayer" == "false" ];
then
mkdir $TMPDIR/"$INFILENAME/"
echo "making temporary directory $TMPDIR/$INFILENAME"
#Split the PDF into pdf's of one page per df in a temporary directory
pdftk "$1" burst output "$TMPDIR/$INFILENAME/pg_%04d.pdf"
echo "burse output to $TMPDIR/$INFILENAME/pg_%04d.pdf"
mv "$1" processed/
for files in "$TMPDIR/$INFILENAME/"*
do
echo "$files"
filename=$(basename "$files")
filename="${filename%.*}"
#Convert the pdf page into an image
gs -r300 -o "$TMPDIR/$INFILENAME/$filename.jpeg" -sDEVICE=jpeg "$TMPDIR/$INFILENAME/$filename.pdf"
#Perform the OCR against the image
tesseract "$TMPDIR/$INFILENAME/$filename.jpeg" "$TMPDIR/$INFILENAME/$filename" hocr
#Combine the OCR'd image and OCR'd text into a multi-layer PDF file of that page
hocr2pdf -i "$TMPDIR/$INFILENAME/$filename.jpeg" -o "$TMPDIR/$INFILENAME/$filename.pdf" < "$TMPDIR/$INFILENAME/$filename.html"
compressed="$filename-compressed.pdf"
#Compress the multi-layered PDF of the page
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$TMPDIR/$INFILENAME/$compressed $TMPDIR/$INFILENAME/$filename.pdf"
mv "$TMPDIR/$INFILENAME/$compressed" "$TMPDIR/$INFILENAME/$filename"
done
#Concatenate all of the multiline PDF pages into a single PDF file
pdftk "$TMPDIR/$INFILENAME/"*.pdf cat output "$OUTFILENAME"
compressed="$OUTFILENAME-compressed.pdf"
#Compress the multi-layered PDF
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$compressed" "$OUTFILENAME"
mv "$compressed" "$2"
rm -rf "$TMPDIR/$INFILENAME"
else
echo "The input file is multi-layered"
mv "$1" "$2"
fi
else
echo "Please enter a valid input pdf file"
exit 2
fi
An easy solution to your problem would be not to use the threading profile based strategy you were setting up and replace the scripting componet with a pooled java component configured like the following:
<pooled-component class="org.mule.PooledComponent">
<pooling-profile exhaustedAction="WHEN_EXHAUSTED_WAIT" maxActive="0" maxWait="-1" initialisationPolicy="INITIALISE_NONE"/>
</pooled-component>
You should place the invocation of your bash script in that component. You can find the docs about it here
#genjosanzo...you put me on the right track thinking about the processing strategy. Here is the solution that ended up working:
<?xml version="1.0" encoding="UTF-8"?>
<mule xmlns:cxf="http://www.mulesoft.org/schema/mule/cxf"
xmlns:scripting="http://www.mulesoft.org/schema/mule/scripting"
xmlns:http="http://www.mulesoft.org/schema/mule/http" xmlns:file="http://www.mulesoft.org/schema/mule/file"
xmlns="http://www.mulesoft.org/schema/mule/core" xmlns:doc="http://www.mulesoft.org/schema/mule/documentation"
xmlns:spring="http://www.springframework.org/schema/beans" version="CE-3.3.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://www.mulesoft.org/schema/mule/file http://www.mulesoft.org/schema/mule/file/current/mule-file.xsd
http://www.mulesoft.org/schema/mule/scripting http://www.mulesoft.org/schema/mule/scripting/current/mule-scripting.xsd
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-current.xsd
http://www.mulesoft.org/schema/mule/core http://www.mulesoft.org/schema/mule/core/current/mule.xsd
http://www.mulesoft.org/schema/mule/cxf http://www.mulesoft.org/schema/mule/cxf/current/mule-cxf.xsd
http://www.mulesoft.org/schema/mule/http http://www.mulesoft.org/schema/mule/http/current/mule-http.xsd ">
<queued-asynchronous-processing-strategy
name="limitThreads" maxThreads="7"
doc:name="Queued Asynchronous Processing Strategy" />
<flow name="Poll_DirectoryFlow1" doc:name="Poll_DirectoryFlow1"
processingStrategy="limitThreads">
<file:inbound-endpoint path="/home/administrator/Downloads/Input"
responseTimeout="10000" doc:name="File" pollingFrequency="60000"
fileAge="5000">
<file:filename-regex-filter pattern="^.*\.(pdf)$"
caseSensitive="false" />
</file:inbound-endpoint>
<scripting:component doc:name="Script">
<scripting:script engine="Groovy">
<scripting:text><![CDATA[def filename = message.getInboundProperty('originalFilename')
println "$filename"
def directory = message.getInboundProperty('directory')
println "$directory"
"mkdir processed".execute()
def command = ["/home/administrator/ocr.sh", "$directory/$filename", "/home/administrator/Downloads/Output/$filename"]
println "$command"
def cmd = command.execute()
cmd.waitFor()
println "$filename has completed processing"]]></scripting:text>
</scripting:script>
</scripting:component>
<echo-component doc:name="Echo"/>
</flow>
</mule>
Related
Using a bash script as a working example
#!/bin/bash
echo "\
<global_preferences>
...
</global_preferences>" >> global_prefs.xml
I tried the following that (IMHO) should have worked but didn't
(
^<config^>
^</config^>
) > test.xml
The following does work but is a PITA as the xml file is long
echo ^<config^> > test.xml
echo ^</config^> >> test.xml
I'm very new to Shell Script. I need to create an XML file with Folder and files within the folder. The requirement is something like below.
For example: I have a folder called 'classes'
under this folder I have multiple files like '1.cls', '2.cls', '3.cls', etc.,
Similarly, I have other folders as well.
For example:
Folder name - 'Pages'
Files Under that folder name - '1.page', '2.page', '3.page' etc.,
Now my XML file should look something like below:
<types>
<members>1</members>
<members>2</members>
<members>3</members>
<name>classes</name>
</types>
<types>
<members>1</members>
<members>2</members>
<members>3</members>
<name>Pages</name>
</types>
Try the following in the directory where other required directories and their files are present:
#!/bin/bash
declare -r XML_FILE="Sample.xml"
[ -f ${XML_FILE} ] && : > ${XML_FILE}
for directory_name in $(ls -F . | grep '/' | sed 's|/||')
do
echo -e "<types>" >> ${XML_FILE}
dirfiles=$(ls -A ${directory_name})
if [ "${dirfiles}" ] ; then
for files in ${dirfiles}
do
echo -e "\t<members>${files/.*}</members>" >> ${XML_FILE}
done
fi
echo -e "\t<name>${directory_name}</name>" >> ${XML_FILE}
echo -e "</types>" >> ${XML_FILE}
done
Example
As per your example statements
mkdir -p classes Pages
touch classes/{1.cls,2.cls,3.cls}
touch Pages/{1.page,2.page,3.page}
Let the script be xmls.sh.
Execute the script: bash xmls.sh
View the output of Sample.xml: cat Sample.xml
<types>
<members>1</members>
<members>2</members>
<members>3</members>
<name>classes</name>
</types>
<types>
<members>1</members>
<members>2</members>
<members>3</members>
<name>Pages</name>
</types>
Now, Sample.xml file contains the above XML elements.
Sorry, I'm from Brazil and my english is not fluent.
I wanna concatenate 20 files using a shellscript through cat command. However when I run it from a file, all content of files are showed on the screen.
When I run it directly from terminal, works perfectly.
That's my code above:
#!/usr/bin/ksh
set -x -a
. /PROD/INCLUDE/include.prod
DATE=`date +'%Y%m%d%H%M%S'`
FINAL_NAME=$1
# check if all paremeters are passed
if [ -z $FINAL_NAME ]; then
echo "Please pass the final name as parameter"
exit 1
fi
# concatenate files
cat $DIRFILE/AI6LM760_AI6_CF2_SLOTP01* $DIRFILE/AI6LM761_AI6_CF2_SLOTP02* $DIRFILE/AI6LM763_AI6_CF2_SLOTP04* \
$DIRFILE/AI6LM764_AI6_CF2_SLOTP05* $DIRFILE/AI6LM765_AI6_CF2_SLOTP06* $DIRFILE/AI6LM766_AI6_CF2_SLOTP07* \
$DIRFILE/AI6LM767_AI6_CF2_SLOTP08* $DIRFILE/AI6LM768_AI6_CF2_SLOTP09* $DIRFILE/AI6LM769_AI6_CF2_SLOTP10* \
$DIRFILE/AI6LM770_AI6_CF2_SLOTP11* $DIRFILE/AI6LM771_AI6_CF2_SLOTP12* $DIRFILE/AI6LM772_AI6_CF2_SLOTP13* \
$DIRFILE/AI6LM773_AI6_CF2_SLOTP14* $DIRFILE/AI6LM774_AI6_CF2_SLOTP15* $DIRFILE/AI6LM775_AI6_CF2_SLOTP16* \
$DIRFILE/AI6LM776_AI6_CF2_SLOTP17* $DIRFILE/AI6LM777_AI6_CF2_SLOTP18* $DIRFILE/AI6LM778_AI6_CF2_SLOTP19* \
$DIRFILE/AI6LM779_AI6_CF2_SLOTP20* > CF2_FINAL_TEMP
mv $DIRFILE/CF2_FINAL_TEMP $DIRFILE/$FINAL_NAME
I solved the problem putting the cat block inside a function, and redirecting stdout to the final file.
Ex:
concatenate()
I get an ambiguous redirect message even though the output file gets created.
my sh script
#!/bin/bash
# you can use read or VAR="$1" to setup these variables
SERVER_IP=
SERVER_PORT=
LANGUAGE_URL=
PROJECT_NAME=
while read f1
do
OUTPUTFIL=$f1
{
echo "<?xml version=\"1.0\" encoding=\"Shift-JIS\"?>"
echo "<flash_cfg>"
echo "<server ip=\"${SERVER_IP}\" port=\"${SERVER_PORT}\"/>"
echo "<language_url>${LANGUAGE_URL}</language_url>"
echo "<project_name>${PROJECT_NAME}</project_name>"
echo "</flash_cfg>"
} > ${OUTPUTFIL}
done < file
content of "file
out.xml
while running
:~/Documents$ bash shell.sh
shell.sh: line 22: ${OUTPUTFIL}: ambiguous redirect
The file out.xml is created however
No contradiction there, you have a loop.
So first you read a valid filename (out.xml), and create a file, then you're reading an invalid one, which creates the error message.
Example (you have an empty line in the input):
f=""
echo "Q" > ${f}
-bash: ${f}: ambiguous redirect
I'd use cat to simplify the code--see if this works any better:
while read f1
do
cat <<EOF >"$f1"
<?xml version="1.0" encoding="Shift-JIS"?>
<flash_cfg>
<server ip="${SERVER_IP}" port="${SERVER_PORT}"/>
<language_url>${LANGUAGE_URL}</language_url>
<project_name>${PROJECT_NAME}</project_name>
</flash_cfg>
EOF
done < file
That's known as a "here document" and lets you avoid all those echo's and quoting.
I'm writing a script to find out the diff between files using the GNU version of the diff command. Here I need to ignore the html comment <!-- and any patterns (provided as input through a file) that is matched.
File wxy/a:
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:p="http://www.springframework.org/schema/p"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.0.xsd">
some text here
<property name="loginUrl" value="http://localhost:15040/ab/ssoLogin"/>
<!--property name="cUrl" value="http://localhost:15040/ab/ssoLogin" /-->
</beans>
File xyz/a:
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:p="http://www.springframework.org/schema/p"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.0.xsd">
some text there
<property name="loginUrl" value="http://localhost:15045/ab/ssoLogin"/>
<!--property name="cUrl" value="http://localhost:15045/ab/ssoLogin" /-->
</beans>
Pattern input file: input.conf:
[a]
http://.*[:0-9]*/ab/ssoLogin
[some other file]
....
....
My script would read the input.conf for the filename [a] and puts to a temp file lines_to_ignore, now I read the file lines_to_ignore and append the pattern to a variable like below
compare_file.sh
diff_ignore_options="-I \"\!--\"" # Ignore option for <!-- Comments
for iline in `cat lines_to_ignore`; do
diff_ignore_options=${diff_ignore_options}" -I \"$iline\""
echo "-----------------------------------------------------------"
diff -I "\!--" -I "$iline" wxy/a xyz/a
echo "-----------------------------------------------------------"
done
diff $diff_ignore_options wxy/a xyz/a
Now the output:
-----------------------------------------------------------
19c19
< some text here
---
> some text there
-----------------------------------------------------------
19,21c19,21
< some text here
< <property name="loginUrl" value="http://localhost:15040/ab/ssoLogin"/>
< <!--property name="cUrl" value="http://localhost:15040/ab/ssoLogin" /-->
---
> some text there
> <property name="loginUrl" value="http://localhost:15045/ab/ssoLogin"/>
> <!--property name="cUrl" value="http://localhost:15045/ab/ssoLogin" /-->
Why is the variable substitution in diff command not working?
diff $diff_ignore_options wxy/a xyz/a
I want to do it the variable way because I might have to match more than one pattern in some files.
The problem is the ! character, which the shell uses for history expansion. Furthermore, you're including escaped double-quote characters in your $diff_ignore_options variable; since the pattern you want to ignore doesn't include any " characters, you don't want that.
This should work (note the use of single quotes to avoid treating ! as a metacharacter):
diff_ignore_options='-I !--'
diff $diff_ignore_options this_file that_file
And you can then add more patterns like this:
diff_ignore_options="$diff_ignore_options -I foobar"