How to enumerate unique characters in a UTF-8 document? With sed? - utf-8

I'm converting some Polish<->English dictionaries from RTF to HTML. The Polish special characters are coming out fine. But IPA (International Phonetic Alphabet) glyphs get changed to funny things, depending on what program I use for conversion. For example, /ˈbiːrɪ/ comes out as /ÈbiùrI/ or /∪βιρΙ/.
I'd like to correct these documents with a search & replace, but I want to make sure I don't miss any characters and don't want to manually pore over dictionary entries. I'd like to output a list of all unique, NON-ascii characters in a document.
I found this thread:
Find Unique Characters in a File
... and I tried the following two proposals:
sed -e "s/./\0\n/g" inputfile | sort -u
sed -e "s/(.)/\1\n/g" inputfile | sort -u
They both work nicely, and seem to both generate the same output. My problem is that they only output standard ASCII characters, and what I'm looking for is exactly the opposite.
The sed tool looks awesome, but I don't have time to learn it right now (though I intend to later). I'm hoping the solution will be clear to someone who's already mastered this tool, and they can save me a lot of time. [-:
Thanks in advance!

This is not a sed solution but a Python solution. It reads the contents of a file, takes it as UTF-8 and then turns it into a set (thus throwing away duplicates), throws away ASCII characters (0-127), sorts it and then joins it back together again with a blank line between each character:
'\n'.join(sorted(set(unicode(open(inputfile).read(), 'utf-8')) - set(chr(i) for i in xrange(128))))
As something you'd run from the command line if you felt so inclined,
python -c "print '\n'.join(sorted(set(unicode(open('inputfile').read(), 'utf-8')) - set(chr(i) for i in xrange(128))))"
(You could also use ''.join instead of '\n'.join which would list the characters without a newline in between.)

Related

remove set of characters surrounding value

I'm redirecting output of an API call to file
however I always get the following characters surrounding the value I need
domainid='^[[39;49;00m^[[33;01m75307d12-e3f4-4a96-ac23-e2a9439f8299^[[39;49;00m'
Desired output
domainid='75307d12-e3f4-4a96-ac23-e2a9439f8299'
I really have no idea how to clean the output and make it look like the above.
Any suggestions will be highly appreciated.
Thank you
Those are ANSI control characters, or escape sequences, and they typically are used to add colors, underline, and so forth to your output.
First order of business is to check if your API command line tool supports a no-color mode. That would solve your problem at the source.
Barring that, try this Server Fault answer, which has a command to clear ANSI sequences out of a text file using sed.
You could remove the undesired characters by replacing the line with just the submatches you want to keep:
... | sed -r "s/(domainid=).*([0-9a-f]{8}(-[0-9a-f]{4}){3}-[0-9a-f]{12}).*/\1'\2'/i"

Windows SED command - simple search and replace without regex

How should I use 'sed' command to find and replace the given word/words/sentence without considering any of them as the special character?
In other words hot to treat find and replace parameters as the plain text.
In following example I want to replace 'sagar' with '+sagar' then I have to give following command
sed "s/sagar/\\+sagar#g"
I know that \ should be escaped with another \ ,but I can't do this manipulation.
As there are so many special characters and theie combinations.
I am going to take find and replace parameters as input from user screen.
I want to execute the sed from c# code.
Simply, I do not want regular expression of sed to use. I want my command text to be treated as plain text?
Is this possible?
If so how can I do it?
While there may be sed versions that have an option like --noregex_matching, most of them don't have that option. Because you're getting the search and replace input by prompting a user, you're best bet is to scan the user input strings for reg-exp special characters and escape them as appropriate.
Also, will your users expect for example, their all caps search input to correctly match and replace a lower or mixed case string? In that case, recall that you could rewrite their target string as [Ss][Aa][Gg][Aa][Rr], and replace with +Sagar.
Note that there are far fewer regex characters used on the replacement side, with '&' meaning "complete string that was matched", and then the numbered replacment groups, like \1,\2,.... Given users that have no knowledge or expectation that they can use such characters, the likelyhood of them using is \1 in their required substitution is pretty low. More likely they may have a valid use for &, so you'll have to scan (at least) for that and replace with \&. In a basic sed, that's about it. (There may be others in the latest gnu seds, or some of the seds that have the genesis as PC tools).
For a replacement string, you shouldn't have to escape the + char at all. Probably yes for \. Again, you can scan your user's "naive" input, and add escape chars as need.
Finally if you're doing this for a "package" that will be distributed, and you'll be relying on the users' version of sed, beware that there are many versions of sed floating around, some that have their roots in Unix/Linux, and others, particularly of super-sed, that (I'm pretty sure) got started as PC-standalones and has a very different feature set.
IHTH.

Trying to use egrep to pull only matched hexadecimal chars from a database

The hex chars are in the form of <9F>. There's lots of data around them, but I am only interested in getting the regex right so I can remove these from a table with sed and then re-import the database.
Here's what I've tried so far:
egrep -io '\<[0-9A-Fa-f]{2}\>' /root/database.sql
And several variations. I can't seem to get it to work, and I'm not sure why. I've tried double/triple escaping and grouping as well. It finds the character pair, but doesn't find the < >'s
So it gives output like this:
39
48
69
6F
B9
but I need it to get the full
<39>
<48>
<69>
<6F>
<B9>
I'm betting that these are bytes outside the range your terminal can display (either control characters or multibyte strings).
First I have to ask if you are absolutely certain that you want to clean this out of the data abase (after all someone might have rally wanted to insert a record with the word 'über' or something). Also, I have seen some databases use control characters as separaters in DB dump files, as well (Ingres comes to mind)
If you still want to scrub the data then maybe try this:
perl -npe 'tr/\001-\177/cd'
I'm typing this on a tablet so I can't double check, but that should delete any characters outside the 7-bit ASCII range.
This works for me:
$cat testfile
<39> <48>
<60><6F><a4>
$egrep -io '<[0-9A-Fa-f]{2}>' testfile
<39>
<48>
<60>
<6F>
<a4>
Escaping the less than and greater than signs makes the match look for beginning of word and end of word boundaries. If you want to match those literal characters then use them unescaped.
Also, you can simplify your regex by using a character class.
grep -E '<[[:xdigit:]]{2}>'
It's possible that you have other non-printing characters that aren't being matched. As Adam suggested, use od or hexdump to see what else may be hiding. Binary files are sometimes represented for output using, e.g., <EF> without those characters actually being there. What is there would be a byte having the value 0xEF. Binary data like that won't work well with sed or grep.

LANG and sed on OSX

In a recent question it was noted that on OSX running sed on a non ascii file gave strange results. For instance if you do (/usr/bin/cal is a random binary file)
sed 's/[^A-Z]//' /usr/bin/cal
sed will remove all of the printable characters other than A-Z, but many nonprintable characters remain. If however, you do
LANG='' sed 's/[^A-Z]//' /usr/bin/cal
only A-Z (and newlines) are output. Why?
Normally LANG=en-US.UTF-8 What is going on? I cannot see anyway that the output of sed could be considered correct in UTF-8. Is it broken, or is there some notion of working that I do not understand?
I know that the OSX sed is conforming to POSIX, and is therefore different from the beloved GNU sed.
Binary data, such as the contents of /usr/bin/cal, are not UTF-8, and so will confuse any code that reads it as if it was. In particular, any byte with the high bit set (e.g., >= 128) will be interpreted as part of a multi-byte sequence representing a single character, and will thus be elided from the output. Not all sequences of bytes with the high-bit set are valid UTF-8, so things will get quite confused, but this probably explains why some non-printable characters remain but (possibly) not others.
In short: if you want to use text-oriented tools on binary data, don't.

Least used delimiter character in normal text < ASCII 128

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Resources