I made an array of filenames of files in which match an pattern:
lista=($(grep -El "<LastVisitedURL>.+</LastVisitedURL>.*<FavoriteTopic>0</FavoriteTopic>" *))
Now I would delete in a file index.xml all tags enclosure which contains the filenames in the array.
for e in ${lista[*]}
do
sed '/\<TopicKey FileName=\"$e\"\>.*\<\/TopicKey\>/d' index.xml
done
The complete script is:
#! /bin/bash
#search xml files watched and no favorites.
lista=($(grep -El "<LastVisitedURL>.+</LastVisitedURL>.*<FavoriteTopic>0</FavoriteTopic>" *))
#declare -p lista
for e in ${lista[*]}
do
sed '/<TopicKey FileName=\"$e\">.*<\/TopicKey>/d' index.xml
done
Even though the regex pattern doesn't work, -i option in sed for edit in place index.xml, reload index file many times how filenames have the array, and this is bad.
Any suggestions?
Here an example using xmlstarlet in a shell :
% cat file.xml
<?xml version="1.0"?>
<root>
<foobar>aaa</foobar>
<LastVisitedURL>http://foo.bar/?a=1</LastVisitedURL>
<LastVisitedURL>http://foo.bar/?a=2</LastVisitedURL>
<LastVisitedURL>http://foo.bar/?a=3</LastVisitedURL>
</root>
Then, the command line :
% xmlstarlet edit --delete '//LastVisitedURL' file.xml
<?xml version="1.0"?>
<root>
<foobar>aaa</foobar>
</root>
Related
Say I have hundreds of *.xml in /train/xml/, in the following format
# this is the content of /train/xml/RIGHT_NAME.xml
<annotation>
<path>/train/img/WRONG_NAME.jpg</path> # this is the WRONG_NAME
</annotation>
The file name WRONG_NAME in <path>...</path> should match that of the .xml file, so that it looks like this:
# this is the content of /train/xml/RIGHT_NAME.xml
<annotation>
<path>/train/img/RIGHT_NAME.jpg</path> # this is the **RIGHT_NAME**
</annotation>
One solution I can think of is to:
1. export all file names into a text file:
ls -1 *.xml > filenames.txt
which generates a file with the content:
RIGHT_NAME_0.xml
RIGHT_NAME_1.xml
...
2. then edit filenames.txt, so that it becomes:
# tab at beginning of each line
<path>/train/img/RIGHT_NAME_0.jpg</path>
<path>/train/img/RIGHT_NAME_1.jpg</path>
...
3. Then, replace the third line of nth .xml file with the nth line from filenames.txt.
Thus the question title.
I've hammered around with sedand awk but had no success. How should I do it (on a EDIT: MacOS machine)? Also, is there a more elegant solution?
Thanks in advance for helping out!
---things I've tried (and didnt work out)---
# this replaces the fifth line with an empty string
for i in *.xml ; do perl -i.bak -pe 's/.*/$i/ if $.==5' RIGHT_NAME.xml ; done
# this apprehends contents of filenames.txt after third line
sed -i.bak -e '/\<path\>/r filenames.txt' RIGHT_NAME.xml
# also, trying to utilize the <path>...</path> pattern...
Untested:
for xml in *.xml; do
sed -E -i.bak '3s/([^/]*.jpg)/'"${xml/.xml/.jpg}/" "$xml"
done
If ed is acceptable since it should be installed by default on a mac.
#!/bin/sh
for file in ./*.xml; do
printf 'Processing %s\n' "$file"
f=${file%.*}; f=${f#*./}
printf '%s\n' H "g/<annotation>/;/<\/annotation>/\
s|^\([[:blank:]]*<path>.*/\)[^.]*\(.*</path>\)|\1${f}\2|" %p Q |
ed -s "$file" || break
done
Will give desired results even if you have
/foo/bar/baz/more/train/img/WRONG_NAME.jpg
Will only edit/parse the string inside the path tag which is inside the annotation tag.
Change Q to w if in-place editing is needed.
Remove the %p to silence the output.
Caveat:
ed is not an xml editor/parser.
Using GNU awk (which you can easily install on MacOS if it's not already present on your system) for "inplace" editing, gensub() and the 3rd arg to match():
$ cat tst.awk
match($0,"(^\\s*<path>.*/).*([.][^.]+</path>)",a) {
name = gensub("(.*/)?(.*)[.][^.]+$","\\2",1,FILENAME)
$0 = a[1] name a[2]
}
{ print }
$ head *.xml
==> RIGHT_NAME_1.xml <==
# this is the content of /train/xml/RIGHT_NAME_1.xml
<annotation>
<path>/train/img/WRONG_NAME.xml.jpg</path>
</annotation>
==> RIGHT_NAME_2.xml <==
# this is the content of /train/xml/RIGHT_NAME_2.xml
<annotation>
<path>/train/img/WRONG_NAME.xml.jpg</path>
</annotation>
$ awk -i inplace -f tst.awk *.xml
$ head *.xml
==> RIGHT_NAME_1.xml <==
# this is the content of /train/xml/RIGHT_NAME_1.xml
<annotation>
<path>/train/img/RIGHT_NAME_1.jpg</path>
</annotation>
==> RIGHT_NAME_2.xml <==
# this is the content of /train/xml/RIGHT_NAME_2.xml
<annotation>
<path>/train/img/RIGHT_NAME_2.jpg</path>
</annotation>
Just call it as awk -i inplace -f tst.awk /train/xml/* on your system. Note that the above just replaces the name in the <path> tag content wherever it occurs on it's own line and so it will work whether that's the 3rd line in any given file or some other line. If you REALLY only want to do this for the 3rd line then just change match(... to FNR==3 && match(....
This might work for you (GNU sed & parallel):
parallel --dry sed -i '3s#[^/]*.jpg#{/.}.jpg#' {} ::: /train/xml/*.xml
In parallel the {} represents the file name and its path whereas the {/.} represents the filename less the path and its extension.
Once the output from the above solution has been checked the option --dry which is short form --dry-run can be removed.
I have this code:
cat response_error.xml | sed -ne 's#\s*<[^>]*>\s*##gp' >> response_error.csv
but all sed match from xml are bonded, for exemple:
084521AntonioCallas
I want to get this effect
084521,Antonio,Callas,
is it possible?
I must write a script which collect XML documents from previous day, extract from them only data without <...> and save this information to csv file in this way: 084521,Antonio,Callas - information separated by commas. The XML look like this:
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<GenerarInformeResponse xmlns="http://experian.servicios.CAIS">
<GenerarInformeResult>
<InformeResumen xmlns="http://experian.servicios.CAIS.V2">
<IdSuscriptor>084521</IdSuscriptor>
<ReferenciaConsulta>Antonio Callas 00000000</ReferenciaConsulta>
<Error>
<Codigo>0000</Codigo>
<Descripcion>OK</Descripcion>
</Error>
<Documento>
<TipoDocumento>
<Codigo>01</Codigo>
<Descripcion>NIF</Descripcion>
</TipoDocumento>
<NumeroDocumento>000000000</NumeroDocumento>
<PaisDocumento>
<Codigo>000</Codigo>
<Descripcion>ESPAÑA</Descripcion>
</PaisDocumento>
</Documento>
<Resumen>
<Nombre>
<Nombre1>XXX</Nombre1>
<Nombre2>XXX</Nombre2>
<ApellidosRazonSocial>XXX</ApellidosRazonSocial>
</Nombre>
<Direccion>
<Direccion>XXX</Direccion>
<NombreLocalidad>XXX</NombreLocalidad>
<CodigoLocalidad/>
<Provincia>
<Codigo>39</Codigo>
<Descripcion>XXX</Descripcion>
</Provincia>
<CodigoPostal>39012</CodigoPostal>
</Direccion>
<NumeroTotalOperacionesImpagadas>1</NumeroTotalOperacionesImpagadas>
<NumeroTotalCuotasImpagadas>0</NumeroTotalCuotasImpagadas>
<PeorSituacionPago>
<Codigo>6</Codigo>
<Descripcion>XXX</Descripcion>
</PeorSituacionPago>
<PeorSituacionPagoHistorica>
<Codigo>6</Codigo>
<Descripcion>XXX</Descripcion>
</PeorSituacionPagoHistorica>
<ImporteTotalImpagado>88.92</ImporteTotalImpagado>
<MaximoImporteImpagado>88.92</MaximoImporteImpagado>
<FechaMaximoImporteImpagado>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaMaximoImporteImpagado>
<FechaPeorSituaiconPagoHistorica>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaPeorSituaiconPagoHistorica>
<FechaAltaOperacionMasAntigua>
<DD>16</DD>
<MM>12</MM>
<AAAA>2015</AAAA>
</FechaAltaOperacionMasAntigua>
<FechaUltimaActualizacion>
<DD>27</DD>
<MM>03</MM>
<AAAA>2019</AAAA>
</FechaUltimaActualizacion>
</Resumen>
</InformeResumen>
</GenerarInformeResult>
</GenerarInformeResponse>
</s:Body>
</s:Envelope>
You can extract the IdSuscriptor using the following command :
xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml
And the ReferenciaConsulta using the following command :
xmllint --xpath '//*[local-name()="ReferenciaConsulta"]/text()' response_error.xml
To produce the desired IdSubscriptor,FirstName,LastName I would use the following script :
id_suscriptor=$(xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml)
referencia_consulta=$(xmllint --xpath '//*[local-name()="IdSuscriptor"]/text()' response_error.xml)
first_name=$(echo "$referencia_consulta" | cut -f1)
last_name=$(echo "$referencia_consulta" | cut -f2)
echo "$id_suscriptor,$first_name,$last_name"
Note that this assumes the ReferenciaConsulta field will always contain a string starting with the first name and last name separated with a space.
If you want to parse XML, use a dedicated XML parser like Saxon.
If you want to parse a strange text file with some funny unrelated angle brackets, try this:
#! /bin/sed -nf
s/^<IdSuscriptor>\([0-9]\+\)<\/IdSuscriptor>/\1,/
t match1
b next
: match1
h
b
: next
s/^<ReferenciaConsulta>\([^ ]\+\) \([^ ]\+\) [0-9]\+<\/ReferenciaConsulta>/\1,\2,/
t match2
b
: match2
H
g
s/\n//
p
Explanation
t jumps to match1, if the preceeding s command did a replacement. Otherwise b jumps to next.
In case of a match h copies the matching string into the hold space and b stops the processing of the current line.
The second s command works the same way with the difference, that in case of no match b continues with the next line.
In case of the second match H appends the pattern space to the hold space, g copies the hold space to the pattern space, s removes the newline between the two matches and p prints the result.
Conclusion
If you do not know how to do it with sed don't try it. Try to learn a real programming language like Perl or JavaScript or Python. sed is a relic of bygone times.
if your data in 'd' file, try gnu sed:
sed -Ez 's/<[^>]*>//g;s/\n+|\s+/,/g;' d
I have a few hundred .txt files in a directory that have the following format:
<DOC>
<DOCNO> 33 </DOCNO>
<SOURCE> URL v.01 </SOURCE>
<URL> www.url.com/extension.html </URL>
<DATE> 2019/12/29/ </DATE>
<TIME> </TIME>
<AUTHOR> </AUTHOR>
<HEADLINE>
The title is here
</HEADLINE>
<TEXT>
Text that I want
</TEXT>
</DOC>
I would like to manipulate every single file so that the file would only contain the text between the <TEXT> and </TEXT> tags (i.e.Text that I want)
I have tried the following code but it does not seem to do what I need:
find /root/Desktop/data/data -type f | xargs sed -n '/<TEXT/,/<\/TEXT/p'
How can I do this using a bash script (preferably using sed)?
You want to remove everything but the text between TEXT tags from your files, right? This is how you do that.
find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/<\/TEXT>/,/<TEXT>/d' {} +
If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:
#!/bin/bash
for file in /root/Desktop/data/data/*.txt; do
echo $(cat "$file" | tr -d '\n' | sed -nE 's/<TEXT>(.*)<\/TEXT>/\1/p')
done
I need to remove all the white spaces for lines which starting with a pattern in a file.
I don't want to loop through lines. Is there any simple and quick solution?
For example
Input file:
<id xxx>dafd</id>
<r>31,31, 31</r>
<r> 0, 0,0 </r>
The output file need to be
<id xxx>dafd</id>
<r>31,31,31</r>
<r>0,0,0</r>
Like this?:
echo "<id xxx>dafd</id>
<r>31,31, 31</r>
<r> 0, 0,0 </r>" | sed -r '/<r>/s/ //g;'
<id xxx>dafd</id>
<r>31,31,31</r>
<r>0,0,0</r>
Explanation:
sed -r : use extended regular expresions
/<r>/ : Lines matching
s/ //g; : Substitute blanks with nothing, globally.
Hi you can do it by the below script. First create a file like mytream.sh and add below lines and change the permission of the file and execute:-
vi mytream.sh
now add below lines:-
#!/bin/bash
file_to_tream="yourfilename"
sed '/<r>/s/ //g' $file_to_tream > tmp.txt
mv tmp.txt $file_to_tream
Or if you do it for any file, just change your script like below and provide the file name in command prompt
#!/bin/bash
sed '/<r>/s/ //g' $1 > tmp.txt
mv tmp.txt $1
Now run it like
chmod 777 mytream.sh
./mytream.sh yourfileName
Hope this will help you.
Suppose there is one file.txt in which text is written as mentioned below:-
ABC
EFG
XYZ
In another xml, there is one empty body target named(compile) defined.
<project>
<compile>
.
.
.
start //from here till EOF
shell
script
xyz
</compile>
</project>
I need a shell script which fill the content in between the target defined . After executing the script it should look as mentioned below in output tag.It will be done for the entire content written in file.txt file.
Output:-
<!-- ...preceding portions of input document... -->
<project>
<compile>
componentName="ABC"
componentName="EFG"
componentName="XYZ"
start
shell
script
xyz
</compile>
</project>
<!-- ...remaining portions of input document... -->
Use a proper XML parser. XMLStarlet is one tool fit for the job:
#!/bin/bash
# ^^^^- important, not /bin/sh
# read input file into an array
IFS=$'\n' read -r -d '' -a pieces <file.txt
# assemble target text based on expanding that array
printf -v text 'componentName=%s\n' "${pieces[#]}"
# Read input, changing all elements named "compile" in the default namespace
# ...to contain our target text.
xmlstarlet ed -u '//compile' -v "$text" <in.xml >out.xml
You can do what you are attempting (to some degree) with sed and a while read -r loop. For example, you can fill a temporary file with the contents of your xml file from line 1 to the <targettag> with
sed -n "1, /^${ttag}$/p" "$xfn" > "$ofn" ## fill output to ttag
(where xfn is your xml file name and ofn is your output file name)
You can then read all values from your text file and prepend componentName=" and append " with:
while read -r line; do ## read each line in ifn and concatenate
printf "%s%s\"\n" "$cmptag" "$line" >> "$ofn"
done <"$ifn"
(where ifn is your input file name)
And finally, you can write the closing tag to end of your xml file to your output file with:
sed -n "/^${ttag/</<[\/]}$/, \${p}" "$xfn" >> "$ofn"
(using parameter expansion with substring replacement to add the closing '/' to the beginning of <targettag>.
Putting it altogether, you could do something like:
#!/bin/bash
ifn="f1"
xfn="f2.xml"
ofn="f3.xml"
ttag="${1:-<targettag>}" ## set target tag
cmptag="componentName=\"" ## set string to prepend
sed -n "1, /^${ttag}$/p" "$xfn" > "$ofn" ## fill output to ttag
while read -r line; do ## read each line in ifn and concatenate
printf "%s%s\"\n" "$cmptag" "$line" >> "$ofn"
done <"$ifn"
## fill output from closing tag to end
sed -n "/^${ttag/</<[\/]}$/, \${p}" "$xfn" >> "$ofn"
Input Files
$ cat f1
ABC
EFG
XYZ
$ cat f2.xml
<someschema>
<targettag>
</targettag>
</someschema>
Example Use/Output
$ fillxml.sh
$ cat f3.xml
<someschema>
<targettag>
componentName="ABC"
componentName="EFG"
componentName="XYZ"
</targettag>
</someschema>
(you can adjust the indentation to fit your needs)
Addition After Changes to Question
The changes needed to handle writing from start to end after adding the componentName="..." tags are simple. However, the commonality of the word start exemplifies why the answer by Charles encourages you to use an XML tool rather than a simple script. Why? If the word 'start' occurs anywhere else in your .xml file before your intended start, the script will fail by writing for the first occurrence of start to the end.
That said, if this is a simple on-off conversion and start doesn't occur otherwise, then the changes to the script to accomplish your desired output are easy:
#!/bin/bash
ifn="f1"
xfn="another.xml"
ofn="f3.xml"
ttag="${1:-<compile>}" ## set target tag
cmptag="componentName=\"" ## set string to prepend
sed -n "1, /^${ttag}$/p" "$xfn" > "$ofn" ## fill output to ttag
## read each line in ifn and concatenate
while read -r line || [ -n "$line" ]; do
printf "%s%s\"\n" "$cmptag" "$line" >> "$ofn"
done <"$ifn"
## fill output from 'start' to end
sed -n "/^start/, \${p}" "$xfn" >> "$ofn"
Input Files
$ cat f1
ABC
EFG
XYZ
$ cat another.xml
<project>
<compile>
start
shell
script
xyz
</compile>
</project>
Example Use/Output
$ cat f3.xml
<project>
<compile>
componentName="ABC"
componentName="EFG"
componentName="XYZ"
start
shell
script
xyz
</compile>
</project>
Look it over and let me know if you have questions.