Best way to generate random unicode strings for a file - random

I have a text file that has many different text tags in them, for each text tag I want to generate a specific number of random unicode characters. Here's an example.
pages: [
'{"text":"ᛀ孥ງ盺阉ᰧ펄轙詙㫏鉱猈し谽损찶郩苽ꟾ低კᐵ偱⼥��溰斶퉉倰ꦑ륏쑁洤ᕿᆯ殗媠퓵뉭ꧪ텠띬䡊⊝ᚑ䖩䉧��쾒紨驪։鹓嶺伢꽫⾨閏뚲髌鰣ᯋ왘觽濔૛䙦ꨣ뫱즔슍ㄽٚឣ酭ᇤཛ稳며ᇶ菤큲뭏☗᩟䴄��䪵匧썑ᛁ唽ೂ㴾귬Ⴀ܁럒鋻㤷踂ᦔ㚚즙泱蹑轷ᨭꪃ冿讻௶ꪪ㬛樐荀꜇뒩膜걼줪퍏匫ẜ릣噪뻷婓⠫愣㒹…䮃䙼ネ辞憥㩵爤ޗ럯搯꿶ꏂ犔앙Ძ葐㉱ᩅᘰ㚞桭䤠쀢ೝ铳̴᠚㢘흱艰顱⦜Ü醲빰뿟闊리怳긵﹡殿뽨帼琧롪ؾᎣꞸ஠j"}',
'{"text":"깈끒찙蚱쓗皛ৠ綢屉撋﯅⤄㇢糮꩚ᅺ䏑⤵핫渕眝틁辶黣⚮㬤읙稓伐쏆⸚綀ᥖ㿏晲䊔๣큅᠆࡮㒲쏉䶧쁔쟚﨧씏襫瞟��荄��痮邎馝佞뀵钬ಭ綺磐饪狗촿㺵보깵ꍔ䡇⿩腾Ꟈ筰䵏䧦䔆外ᮺࡷ匞ꫩ㫠싣塞⺡截꥖ⴥ蘟籍퓉ᰧ婑锼戰魍藀ߪ查ྩꂔ䱖穁䙐땴퐁谌菑諸앚굼뾯쁊⟛軠苂뎀ঘ킲ඇ橨蜻䰐嶪ᬞ弯殦귶⇰薪鶑ニ꬛礟쒊焇㛍詙ཀ衞睤㙂됴쫴累릮쾊謡ꋱ溘ܪ握믓䇲돃쨥咽鵝閟ꙑ牊Ṑ㓁溺㱟⯳꟯뒧戜ቼ뵌༧⽆Ⲇ㯞伌ቈ㹝カף��ꎧꘜ꨺꧑��韾섒"}',
'{"text":"⤝즫㮂쀱ꪯ፣㇚鵅삄섻≖衕㉏঱⚫��鎉⁶췩্쟴��Ḋ좇鑯넫⿏㩐烃ᬟ㉳斺��ꂷ傳䷼譛ꇆ㌡慎翟瘶䨖픩虷⨨嫝갠ᱰ툈努甹șཥ↓菮滋㠼鬠訮裎အ嗠ṏ탔뎼춡蟱㣴뽳骘쬄ᘵ㢐똏鳛㤣᫖뱥䞡ࢍ⫰榞愺㍴眉伪璬瀎汢햫驛鉄食䊛ᾛ죈㨼笘ꚩ佒嬔볁Џ胫앳̘㛀��頾ᰎ孶䟌⾗些䇛홫緗ܑ踚ヽ휝磁좪隱켧ሬ脝쨘戇㽰ȯ眪蕁ꘈ艢㦪檇擟佃픍൳ߺ᱗ﶚ逄鎐뒽ƈ뢫㛇臊蒠ⷑ醑둭샤쿫ໞᏫ酨᜖ភ᠙᫮梹ࢃ؏市튮틎蒇遃绿巴釗ῆ鑹䲮꠬☐搠潚楛횩⵲絉셫ᥔ郝ٚ䍄끕螻醁㨄"}'
],
I need this to be automated as there are over 2,000 of these text tags and I need each filled with random unicode characters so as to be hard to compress. I'll greatly appreciate any help or suggestions.

Within Vim, you first need a random-number generator. The Rndm.vim plugin provides a Urndm() function for a uniformly distributed pseudo-random interval.
Combining that with nr2char(), we can convert to a Unicode character. The range 0x4E00-0x9FFF contains CJK UNIFIED IDEOGRAPHS, for example.
For the simplicity of a one-liner, here I create a range() of ten numbers, turn them into ten random ones, convert to characters, then join and insert as a new line.
:put =join(map(range(1, 10), 'nr2char(Urndm(0x4e00, 0x9fff))'), '')
瑿桑輛緪蝔呔殐夶級叝

Related

XSL-FO: wrapping long words in table cell

I'm using Docbook-XSL and Apache FOP to generate PDF documents containing tables. With the default settings, tables have fixed-width columns and lines wrap at word boundaries. But if a word is longer than the cell width, it overflows the cell. I'd like to break up the words across multiple lines in such a case. How could this be done?
Hyphenation is not a solution since the words need not be in English. (Edit: hyphenation in other languages is not a solution either. It may not be known ahead of time what language the data is in, and there may be "words" that cannot be hyphenated, such as numeric strings.)
I found suggestions to use keep-together.within-column="always" for fo:table-rows, but that didn't seem to have any effect.
(Edit:) Another suggestion was to insert zero-width spaces between all characters. But this also breaks short words mid-word. I would need a solution that breaks at word boundaries whenever possible, and mid-word only when needed.
FOP, like just about every FO processor, can hyphenate languages other than English. See http://xmlgraphics.apache.org/fop/2.1/hyphenation.html
You could try using an FO processor, such as Antenna House AH Formatter, that implements 'auto' table layout and can adjust the widths of the table columns depending on where the text can break (as well as do hyphenation for multiple languages).
Other answers for breaking text in table cells are at:
Force line break after string length
XSL-FO: Force Wrap on Table Entries

ASCII Code for Uppercase/Capital R with a Tilde Character Above

I am trying to get the equivalent of LaTeX's $\tilde R$ in a Stata graph axis label. I don't thinks there's a SMCL way of doing that, but it's possible to use ASCII characters. However, there does not seem to be an ASCII code for an uppercase/capital R with a tilde above it.
Is there any way around that? Is it possible to combine ASCII characters somehow?
In Stata 14, this can be accomplished with:
`=ustrunescape("\u0052\u0303")'
This combines the Unicode for capital R with the one for tilde.
MVE:
sysuse auto, clear
tw scatter price mpg, title(`=ustrunescape("\u0052\u0303")')
should produce something like this (modulo scheme):
EDIT: From Stata 14. Stata supports Unicode.
ORIGINAL ANSWER for versions up to Stata 13:
The user-written program asciiplot (SSC) displays those characters available to you via char(), depending on what alphabet you are using. Your mileage may differ, but I see no such character.
Stata does not, at this writing, support LaTeX or over-striking or combinations of ASCII characters.

Decoded barcode extra digits

I am trying to come to terms with how a barcode is decoded and generated by a scanner.
A note from the client says the following generated bar code consists of extra characters:
Generated Code: |2389299920014}
Extra Characters: Apparently the first two and last three characters are not part of the bar code.
Question
Are the extra characters attached by the bar code reader (therefore dependent on the scanner) or are they an intrinsic part of the barcode?
Here is a sample image of a barcode:
http://imageshack.us/a/img824/1862/dm6x.jpg
Thanks
[SOLVED] My apologies. This was just another one of those cases of 'shooting your mouth off' without doing proper research.
Solution The code is EAN13. The prefix and suffix are probably scanner dependent. The 13 digits in between are as follows (first digit from the left) Check Sum (Next 9 digits) Company Id + Item Id (Last 3 Digits ) GS1 prefix
It's hard to answer without understanding what format you are trying to encode, what the intended contents are, and what the purported contents are.
Some formats add extra information as part of the encoding process, but it does not become part of the content. When correctly encoded and decoded, the output should match the input exactly.
Barcodes encode what they encode and there is no data that is somehow part of the barcode but not somehow encoded in it.
EAN-13 has no scanner-dependent considerations, no. The encoding and decoding of a given number is the same everywhere. EAN-13 encodes 13 digits, so I am not sure what the 13 digits "in between" mean.
You mention GS1, which is something else. A family of barcodes in fact. You'd have to say what specifically you are using. The GS1 encodings are likewise not ambiguous or scanner-dependent. You know what you want to encode, you encode it exactly, it's read exactly.

Compression algorithms for Strings

I have to generate QRCodes using concatenated object properties. These strings might be long, that's why I'd like to know which compression algorithm to use knowing that my String's length is between 25 an 100+ characters
thanks in advance,
Jerec
I am assuming that since you are going to use compression before you store the strings that these QR codes will not be readable by any client, it would have to be an application that you wrote (b/c you are storing character with an unknown encoding, the client won't be able to decode).
Instead of compressing and storing the long string in the QR code, have your application create a URI (like a GUID or a URL) and when your application decodes that URI it looks up all the values (uncompressed) that you wanted to store in the QR code. Then your app can just look up the format in any way it wants.
For example, assuming your persistant storage is an xml file, but it could be anything:
<URI = "http://mydomain.com/790C9704-8C61-435F-991D-CDBB5767AA3D">
<MyElement>14523</MyElement>
<MyElement>67548</MyElement>
...
<MyElement>46167</MyElement>
</URI>
Encoded on QR code: "http://mydomain.com/790C9704-8C61-435F-991D-CDBB5767AA3D", values can then be looked up.
The algorithm used to encode QR codes is dependent on the type of data you encode. See http://www.swetake.com/qr/qr1_en.html.
If you know, for example, that you always have the same number of digits per id and therefor could just string them together without punctuation, you can encode them as purely numeric and you'll use 10 bits for every three characters.
If you need some kind of separator, if you use something in "0-9A-Z $%*+-./:", you'll stay alphanumeric and get 2 characters in 11 bits.
If you give it arbitrary data (note that this includes any lower case: the list above does not include lower case letters) you're going to be using 8 bits per characters.
So numeric only would end up being 60% smaller.

Least used delimiter character in normal text < ASCII 128

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Resources