xmllint: how to convert UTF-8 numeric references into characters - utf-8

I'd like to convert UTF-8 numeric references into characters in the output from xmllint.
To reproduce:
$ wget http://il.srgssr.ch/integrationlayer/1.0/ue/rts/video/play/4727630.xml
$ xmllint --xpath "/Video/AssetMetadatas/AssetMetadata/title/text()" 4727630.xml && echo
Le jardin apprivoisé - Entre pierre et bois
I'd like the output to be:
Le jardin apprivoisé - Entre pierre et bois
I've read the man page and tried different options, but nothing worked.
If possible I'd like to achieve this using options from xmllint, or if this is not possible with another command line tool which is commonly found in Linux distributions.
Thanks!

I understand that the question is a little bit outdated by I came here from Google and want to share possible answer for future visitors.
It is necessary to slightly change xpath expression and use string() function instead of text():
$ wget http://il.srgssr.ch/integrationlayer/1.0/ue/rts/video/play/4727630.xml
$ xmllint --xpath "string(/Video/AssetMetadatas/AssetMetadata/title)" 4727630.xml
Le jardin apprivoisé - Entre pierre et bois

I have found another way which I think can completely solves this problem. The trick is using the recode library provided by GNU to change output encoding from html to utf8.
$ wget http://il.srgssr.ch/integrationlayer/1.0/ue/rts/video/play/4727630.xml
$ xmllint --xpath "/Video/AssetMetadatas/AssetMetadata/title/text()" 4727630.xml | recode html..utf8
Le jardin apprivoisé - Entre pierre et bois
recode can be installed using apt-get install recode.

How about good old sed and echo?
$ wget http://il.srgssr.ch/integrationlayer/1.0/ue/rts/video/play/4727630.xml
$ echo -e $(xmllint --xpath "/Video/AssetMetadatas/AssetMetadata/title/text()" 4727630.xml | sed -e 's/&#x/\\u/g' -e 's/;//g')
Le jardin apprivoisé - Entre pierre et bois

Related

Sed command to uppercase text between two specific strings

I want to parse a file and replace the text between "::" and ":::" with the text already there, just now capitalized.
I've tried using this command:
sed 's/\(::\)\(.*\)\(:::\)/\1\U\2\E\3/' filename
but the output just puts a U in beginning and E at the end of the string I want capitalized
Works for me, which makes me think you may not be on Linux?
echo "This is :: some sample text ::: to test uppercasing" | sed 's/\(::\)\(.*\)\(:::\)/\1\U\2\E\3/'
This is :: SOME SAMPLE TEXT ::: to test uppercasing
If Perl is your option, you can say something like:
echo "This is :: some sample text ::: to test uppercasing" | perl -pe 's/(::)(.*)(:::)/\1\U\2\E\3/'
This is :: SOME SAMPLE TEXT ::: to test uppercasing
gawk '{match($0,/::.*:::/,a) ;gsub(/::.*::/,toupper(a[0]))}1' input
Here ,bit less cryptic solution with gawk:, match is used to find the desired string ,later that string is used by gsub to convert it to upped cause using toupper function.
You are pretty close.
On Mac OS X, you will need to install GNU sed, because the feature you are using - \U - is a GNU extension.
So, start by installing it:
▶ brew install gnu-sed
Then I normally stick in some code like this somewhere:
shopt -s expand_aliases
alias sed='/usr/local/bin/gsed'
And then your GNU sed will work.
Finally, I would simplify that code as:
▶ sed -E 's/(::)(.*)(::)/\1\U\2\E\3/' <<< "foo::bar::baz"
foo::BAR::baz
Noting that -E gives you Extended Regular Expressions, and a cleaner syntax when you are doing captures.
This might work for you (GNU sed):
sed 's/::[^:]*:::/\U&/' file
or perhaps:
sed 's/::[^:]*:::/\n&\n/;h;y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/;G;s/.*\n\(.*\)\n.*\n\(.*\)\n.*\n/\2\1/' file
Using seds y native translate command, pattern matching and a copy held in the hold space.

Extracting string from Substring in shell script

I have a line like this:
<option value="bo">Tibetan Standard, Tibetan, Central</option>
I want an output like this:
bo Tibetan Standard, Tibetan, Central
When I am trying to do with sed:
sed -r 's/.*value="(\S+).*">(\S+)<.*/\1 \2/'
It gives only:
bo Tibetan
Can anyone help me?
Thanks in advance
The following modification to your original sed should work:
sed -r 's_.*value="(.*)">(.*)</option>_\1 \2_'
The following example:
sed -r 's_.*value="(.*)">(.*)</option>_\1 \2_' <<< '<option value="bo">Tibetan Standard, Tibetan, Central</option>'
Prints the desired output:
bo Tibetan Standard, Tibetan, Central

Using Sed to pull out base URL of full website

I'm looking to cat a file containing something like this:
http://www.site1.com/d23bdbd0fbc517d34, r N 4
https://www.site2.com/file/d23bdbd0fbc517d34, X
http://www.site3.com/file/d23bdbd0fbc517d34
https://www.site4.edu/site/d23bdbd0fbc517d34
and I need use sed to get this kind of output:
www.site1.com
www.site2.com
www.site3.com
www.site4.edu
Help! I can't get it fully working right. Technically I'm using sed.exe for Windows but it's probably very similar.
$ cat file.txt
http://www.site1.com/d23bdbd0fbc517d34, r N 4
https://www.site2.com/file/d23bdbd0fbc517d34, X
http://www.site3.com/file/d23bdbd0fbc517d34
https://www.site4.edu/site/d23bdbd0fbc517d34
$ sed -r 's#.*//([^ /]+).*#\1#g' file.txt
www.site1.com
www.site2.com
www.site3.com
www.site4.edu
If you don't have -r switch :
sed 's#.*//\([^ /]\+\)[/ ].*#\1#g' file.txt
Moreover, under windows IIRC, use double-quotes instead of single-quotes.
So maybe :
sed.exe "s#.*//\([^ /]\+\)[/ ].*#\1#g" file.txt
Another variant is:
sed '\#.*www[.]\([^/]*\).*# s::\1:'
will display
site1.com
site2.com
site3.com
site4.edu
tested with
#ThinkPad-T420:~$ sed --version
GNU sed version 4.2.1

Unaccent string in bash script (RHEL)

On Debian-based distributions, there is a utility called unaccent which can be used to remove accents from accented letters in a text.
I was looking for a package containing this on Redhat distros, but the only one I found was unac available for Mandriva only.
I tried to use iconv but it seems to not support my case.
What is the best, lightweight approach, easily usable in a bash script ?
Are there any secret options to iconv that allow this ?
You can use the -c(clear) option in iconv to remove non-ascii chars:
$ echo 'été' | iconv -c -f utf8 -t ascii
t
If you just want to remove the accent:
$ echo 'été' | iconv -f utf8 -t ascii//TRANSLIT
ete

Case-insensitive search and replace with sed

I'm trying to use SED to extract text from a log file. I can do a search-and-replace without too much trouble:
sed 's/foo/bar/' mylog.txt
However, I want to make the search case-insensitive. From what I've googled, it looks like appending i to the end of the command should work:
sed 's/foo/bar/i' mylog.txt
However, this gives me an error message:
sed: 1: "s/foo/bar/i": bad flag in substitute command: 'i'
What's going wrong here, and how do I fix it?
Update: Starting with macOS Big Sur (11.0), sed now does support the I flag for case-insensitive matching, so the command in the question should now work (BSD sed doesn't reporting its version, but you can go by the date at the bottom of the man page, which should be March 27, 2017 or more recent); a simple example:
# BSD sed on macOS Big Sur and above (and GNU sed, the default on Linux)
$ sed 's/ö/#/I' <<<'FÖO'
F#O # `I` matched the uppercase Ö correctly against its lowercase counterpart
Note: I (uppercase) is the documented form of the flag, but i works as well.
Similarly, starting with macOS Big Sur (11.0) awk now is locale-aware (awk --version should report 20200816 or more recent):
# BSD awk on macOS Big Sur and above (and GNU awk, the default on Linux)
$ awk 'tolower($0)' <<<'FÖO'
föo # non-ASCII character Ö was properly lowercased
The following applies to macOS up to Catalina (10.15):
To be clear: On macOS, sed - which is the BSD implementation - does NOT support case-insensitive matching - hard to believe, but true. The formerly accepted answer, which itself shows a GNU sed command, gained that status because of the perl-based solution mentioned in the comments.
To make that Perl solution work with foreign characters as well, via UTF-8, use something like:
perl -C -Mutf8 -pe 's/öœ/oo/i' <<< "FÖŒ" # -> "Foo"
-C turns on UTF-8 support for streams and files, assuming the current locale is UTF-8-based.
-Mutf8 tells Perl to interpret the source code as UTF-8 (in this case, the string passed to -pe) - this is the shorter equivalent of the more verbose -e 'use utf8;'.Thanks, Mark Reed
(Note that using awk is not an option either, as awk on macOS (i.e., BWK awk and BSD awk) appears to be completely unaware of locales altogether - its tolower() and toupper() functions ignore foreign characters (and sub() / gsub() don't have case-insensitivity flags to begin with).)
A note on the relationship of sed and awk to the POSIX standard:
BSD sed and awk limit their functionality mostly to what the POSIX sed and
POSIX awk specs mandate, whereas their GNU counterparts implement many more extensions.
Editor's note: This solution doesn't work on macOS (out of the box), because it only applies to GNU sed, whereas macOS comes with BSD sed.
Capitalize the 'I'.
sed 's/foo/bar/I' file
Another work-around for sed on Mac OS X is to install gsedfrom MacPorts or HomeBrew and then create the alias sed='gsed'.
If you are doing pattern matching first, e.g.,
/pattern/s/xx/yy/g
then you want to put the I after the pattern:
/pattern/Is/xx/yy/g
Example:
echo Fred | sed '/fred/Is//willma/g'
returns willma; without the I, it returns the string untouched (Fred).
The sed FAQ addresses the closely related case-insensitive search. It points out that a) many versions of sed support a flag for it and b) it's awkward to do in sed, you should rather use awk or Perl.
But to do it in POSIX sed, they suggest three options (adapted for substitution here):
Convert to uppercase and store original line in hold space; this won't work for substitutions, though, as the original content will be restored before printing, so it's only good for insert or adding lines based on a case-insensitive match.
Maybe the possibilities are limited to FOO, Foo and foo. These can be covered by
s/FOO/bar/;s/[Ff]oo/bar/
To search for all possible matches, one can use bracket expressions for each character:
s/[Ff][Oo][Oo]/bar/
The Mac version of sed seems a bit limited. One way to work around this is to use a linux container (via Docker) which has a useable version of sed:
cat your_file.txt | docker run -i busybox /bin/sed -r 's/[0-9]{4}/****/Ig'
Use following to replace all occurrences:
sed 's/foo/bar/gI' mylog.txt
I had a similar need, and came up with this:
this command to simply find all the files:
grep -i -l -r foo ./*
this one to exclude this_shell.sh (in case you put the command in a script called this_shell.sh), tee the output to the console to see what happened, and then use sed on each file name found to replace the text foo with bar:
grep -i -l -r --exclude "this_shell.sh" foo ./* | tee /dev/fd/2 | while read -r x; do sed -b -i 's/foo/bar/gi' "$x"; done
I chose this method, as I didn't like having all the timestamps changed for files not modified. feeding the grep result allows only the files with target text to be looked at (thus likely may improve performance / speed as well)
be sure to backup your files & test before using. May not work in some environments for files with embedded spaces. (?)
Following should be fine:
sed -i 's/foo/bar/gi' mylog.txt

Resources