How to grep for exact hexadecimal value of characters - bash

I am trying to grep for the hexadecimal value of a range of UTF-8 encoded characters and I only want just that specific range of characters to be returned.
I currently have this:
grep -P -n "[\xB9-\xBF]" $str_st_location >> output_st.txt
But this returns every character that has any of those hex values in it hex representation i.e it returns 00B9 - FFB9 as long as the B9 is present.
Is there a way I can specify using grep that I only want the exact/specific hex value range I search for?
Sample Input:
STRING_OPEN
Open
æ–­å¼€
Ouvert
Abierto
Открыто
Abrir
Now using my grep statement, it should return the 3rd line and 6th line, but it also includes some text in my file that are Russian and Chinese because the range for languages include the hex values I'm searching for like these:
断开
Открыто
I can't give out more sample input unfortunately as it's work related.
EDIT: Actually the below code snippet worked!
grep -P -n "[\x{00B9}-\x{00BF}]" $str_st_location > output_st.txt
It found all the corrupted characters and there were no false positives. The only issue now is that the lines with the corrupted characters automatically gets "uncorrupted" i.e when I open the file, grep's output is the corrected version of the corrupted characters. For example, it finds æ–­å¼€ and in the text file, it's show as 断开.

Since you're using -P, you're probably using GNU grep, because that is a GNU grep extension. Your command works using GNU grep 2.21 with pcre 8.37 and a UTF-8 locale, however there have been bugs in the past with multi-byte characters and character ranges. You're probably using an older version, or it is possible that your locale is set to one that uses single-byte characters.
If you don't want to upgrade, it is possible to match this character range by matching individual bytes, which should work in older versions. You would need to convert the characters to bytes and search for the byte values. Assuming UTF-8, U+00B9 is C2 B9 and U+00BF is C2 BF. Setting LC_CTYPE to something that uses single-byte characters (like C) will ensure that it will match individual bytes even in versions that correctly support multi-byte characters.
LC_CTYPE=C grep -P -n "\xC2[\xB9-\xBF]" $str_st_location >> output_st.txt

Related

Replace matching differently encoded tags

I have a file that was translated by someone else. I don't know what encoding the person used, but seems that tags $TAG$, that were not supposed to be translated, are converted to another set of characters (i.e. even though the tags look the same, the ASCII characters they include, are not the characters from the original source file). This is messing up further substitution of the Cyrillic characters to extended ASCII-chars (which is not part of the question). So my replacement script replaces the tags (at least partially) as well.
What is the best way to replace the tags in the corrupted file by the corresponding tags from the original one?
The files have to be UTF-8 (with BOM), EOL=LF.
Mac bash preferably, thanks.
one strategy is to make a list of the current utf8 tags, a list of the ascii tags, line them up, then use paste and sed to replace the utf8 tags with ascii tags in the ukranian file:
grep -o '\$[^\$]\+\$' rights_of_man_l_ukrainian.txt | sort | uniq > utf8.tags.list
grep -o '\$[^\$]\+\$' rights_of_man_l_english.txt | sort | uniq > ascii.tags.list
# now, manually edit ascii.tags.list so that each line number has
# the correct replacement for that line of utf8.tags.list, e.g.,
# by using:
vimdiff utf8.tags.list ascii.tags.list
# escape the $s
sed -i 's/\$/\\$/g' utf8.tags.list ascii.tags.list
# now substitute the tags
paste utf8.tags.list ascii.tags.list |
while read n k; do
sed "s/$n/$k/g" rights_of_man_l_ukrainian.txt
done > rights_of_man_l_ukrainian.ascii-tags.txt
a more satisfying way is to automatically generate the utf to ascii conversion table. on mac, iconv and perl Text::Unidecode both turn the utf8 strings into garbage. on linux, konwert shows promise here.
ps: it looks like there is another problem as well, though: two missing tags:
FORCEBREAKALLIANCEDESC:1 "If they accept, both countries' opinion of us will decrease and $WITH|Y$ will get a Casus Belli on us.\nThis will also create a truce between $COUNTRY|Y$ and us, as well as lower their trust of us by $TRUSTCOST|R$. Otherwise, we will lose $PRESTIGE$ Prestige."
vs
FORCEBREAKALLIANCEDESC:1 "Якщо вони погодяться, то ставлення обох країн до нас зменшиться, а держава $WIТН|Y$ отримає привід для війни з нами.\nТакож буде оголошено перемир'я між державою $СОUNТRY|Y$ та нами, а також зменшить їхню довіру до нас. В іншому випадку, ми втратимо $РRЕSТIGЕ$ престижу."
(missing $TRUSTCOST|R$)
and
stat_game_country_desc_server:0 "$VAL|Y$% of players this month played as $NAME|Y$."
vs
stat_game_country_desc_server:0 "В середньому, в цьому місяці у гравців відбулося близько $VАL|Y$ лих."
(missing $NAME|Y$)

Turbo Grep - find special characters in UTF-8 file

I am running Windows 7 and (have to) use Turbo Grep (Borland something) to search in a file.
I have 2 version of this file, one encoded in UTF-8 and one in ANSI.
If I run the following grep on the ANSI file, I get the expected results, but I get no results with the same statement on the UTF-8 file:
grep -ni "[äöü]" myfile.txt
[-n for line numbers, -i for ignoring cases]
The Turbo Grep Version is :
Turbo GREP 5.6 Copyright (c) 1992-2010 Embarcadero Technologies, Inc.
Syntax: GREP [-rlcnvidzewoqhu] searchstring file[s] or #filelist
GREP ? for help
Help for this command lists:
Options are one or more option characters preceded by "-", and optionally
followed by "+" (turn option on), or "-" (turn it off). The default is "+".
-r+ Regular expression search -l- File names only
-c- match Count only -n- Line numbers
-v- Non-matching lines only -i- Ignore case
-d- Search subdirectories -z- Verbose
-e Next argument is searchstring -w- Word search
-o- UNIX output format Default set: [0-9A-Z_]
-q- Quiet: supress normal output
-h- Supress display of filename
-u xxx Create a copy of grep named 'xxx' with current options set as default
A regular expression is one or more occurrences of: One or more characters
optionally enclosed in quotes. The following symbols are treated specially:
^ start of line $ end of line
. any character \ quote next character
* match zero or more + match one or more
[aeiou0-9] match a, e, i, o, u, and 0 thru 9 ;
[^aeiou0-9] match anything but a, e, i, o, u, and 0 thru 9
Is there a problem with the encoding of these charactes in UTF-8? Might there be a problem with Turbo Grep and UTF-8?
Thanks in advance
Yes there are a different w7 use UTF-16 little endian not UTF-8, UTF-8 is used in unix, linux and plan 9 for cite a few OS.
Jon Skeet explain:1
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default code page for my system" which is obtained via Encoding.Default, and is often Windows-1252
UTF-8: Variable length encoding, 1-4 bytes covers every current character. ASCII values are encoded as ASCII.
UTF-16 is more similar to ANSI so for this reason with ANSI work well.
if you use only ascii both encodings are usable, but with special characters as ä ö ü etc you need use UTF-16 in windows and UTF-8 in the others

How to split a file containing non-ascii characters into words, in bash?

For example, I have a file with normal text, like:
"Word1 Kuͦn, buͤtten; word4:"
I want to get a file with 1 word per line, keeping the punctiuation, and ordered:
,
:
;
Word1
Kuͦn
buͤtten
word4
The code I use:
grep -Eo '\w+|[^\w ]' input.txt | sort -f >> output.txt
This the code works almost perfectly, except for one thing: it splits diacretical characters apart from the letters they belong to, as if they were separate words:
,
:
;
Word1
Ku
ͦ
n
bu
ͤ
tten
word4
The letters uͦ, uͤ and other with the same diacretics are not in the ASCII table. How can I split my file correctly without deleting or replacing these characters?
Edit:
locale output:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Unfortunately, U+366 (COMBINING LATIN SMALL LETTER O) is not an alphabetic character. It is a non-spacing mark, unicode category Mn, which generally maps to the Posix ctype cntrl.
Roughly speaking, an alphabetic grapheme is an alphabetic character possibly followed by one or more combining characters. It's possible to write that as a regex pattern if you have a regex library which implements Unicode general categories. Gnu grep is usually compiled with an interface to the popular pcre (Perl-compatible regular expression) library, which has reasonably good Unicode support. So if you have Gnu grep, you're in luck.
To enable "perl-like" regular expressions, you need to invoke grep with the -P option (or as pgrep). However, that is not quite enough because by default grep will use an 8-bit encoding even if the locale specifies a UTF-8 encoding. So you need to put the regex system into "UTF-8" mode in order to get it to recognize your character encoding.
Putting all that together, you might end up with something like the following:
grep -Po '(*UTF8)(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]'
-P patterns are "perl-compatible"
-o output each substring matched
(*UTF8) If the pattern starts with exactly this sequence,
pcre is put into UTF-8 mode.
\p{...} Select a character in a specified Unicode general category
\P{...} Select a character not in a specified Unicode general category
\p{L} General category L: letters
\p{N} General category N: numbers
\p{M} General category M: combining marks
\p{P} General category P: punctuation
\p{S} General category S: symbols
\p{L}\p{M}* A letter possibly followed by various combining marks
\p{L}\p{M}*|\p{N} ... or a number
More information on Unicode general categories and Unicode regular expression matching in general can be found in Unicode Technical Report 18 on regular expression matching. But beware that the syntax described in that TR is a recommendation and is not exactly implemented by most regex libraries. In particular, pcre does not support the useful notation \p{L|N} (letter or number). Instead, you need to use [\p{L}\p{N}].
Documentation about pcre is probably available on your system (man pcre); if not, have a link on me.
If you don't have Gnu grep or in the unlikely case that your version was compiled without pcre support, you might be able to use perl, python or other languages with regex capabilites. However, doing so is surprisingly difficult. After some experimentation, I found the following Perl incantation which seems to work:
perl -CIO -lne 'print $& while /(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]/g'
Here, -CIO tells Perl that input and output in UTF-8, and -nle is a standard incantation which means "automatically output new**l**ines after a print; loop through every li**n**e of the input, **e**xecuting the following in the loop".

Is Bash support Unicode 6.0?

When I use unicode 6.0 character(for example, 'beer mug') in Bash(4.3.11), it doesn't display correctly.
Just copy and paste character is okay, but if you use utf-16 hex code like
$ echo -e '\ud83c\udf7a',
output is '??????'.
What's the problem?
You can't use UTF-16 with bash and a unix(-like) terminal. Bash strings are strings of bytes, and the terminal will (if you have it configured correctly) be expecting UTF-8 sequences. In UTF-8, surrogate pairs are illegal. So if you want to show your beer mug, you need to provide the UTF-8 sequence.
Note that echo -e interprets unicode escapes in the forms \uXXXX and \UXXXXXXXX, producing the corresponding UTF-8 sequence. So you can get your beer mug (assuming your terminal font includes it) with:
echo -e '\U0001f37a'

BASH: Replacing special character groups

I have a rather tricky request...
We use a special application which is connected to a oracle database. For control reasons the application uses special characters which are defined by the application and saved in a long field of the database.
My task is to query the long field periodically and check for changes. To do that, I write the content by using a bash script in a file and compare the old and the new file with md5sum.
When there's a difference, I want to send the old file via mail. The problem is, that the old file contains these special characters and I don't know how to replace them with for example a string which describes them.
I tried to replace them on the basis of their ASCII code, but this didn't work. I've also tried to replace them by their appearance in the file. (They look like this: ^P ) This didn't work neither.
When viewing the file by text editor like nano the characters are visible like described above. But when using cat on the file, the content is only displayed until the first appearance of such a control character.
As far as I know there is know possibility to replace them while querying from the database because of the fact that the content is in a LONG field.
I hope you can help me.
Thank you in advance.
Marco
^P is the Control-P character, which is decimal 16 or hexadecimal 0x10, also known as the Data Link Escape (DLE) character in ASCII.
To replace all occurrences of 0x10 in a file with another string we can use our friend gsed:
gsed "s/\x10/Data Link Escape/g" yourfile.txt
This should replace all occurrences of characters containing the hex value 0x10 with the text string "Data Link Escape". You'll probably want to use a different string - this is just an example.
Depending on the system you're using you may be able to use the standard sed command if your version of sed recognizes the \xNN single-character escape codes. If there are multiple hex characters you need to replace you may want to create a file containing your sed commands, one for each hexadecmial character you need to replace, and tell sed or gsed to use the commands in the file - consult the sed or gsed man pages for how to do this.
Share and enjoy.
You can use xxd to change the string to its hex representation, then use xxd -r to convert back.
Or, you can use uuencode and uudecode.
One option is to run the file through cat -v. This replaces nonprinting characters with visible representations (using the ^ notation for control characters):
$ echo $'\x10\x12\x13\x14\x16' | cat -v
^P^R^S^T^V

Resources