Ruby pack and Latin (high-ASCII) characters

Ruby pack and Latin (high-ASCII) characters - ruby

An action outputs a fixed-length string via Ruby's pack function
clean = [edc_unico, sequenza_sede, cliente_id.to_s, nome, indirizzo, cap, comune, provincia, persona, note, telefono, email]
string = clean.pack('A15A5A6A40A35A5A30A2A40A40A18A25')
However, the data is in UTF-8 as to allow latin/high-ascii characters. The result of the pack action is logical. high-ascii characters take the space of 2 regular ascii characters. The resulting string is shortened by 1 space character, defeating the original purpose.
What would be a concise ruby command to interpret high-ascii characters and thus add an extra space at the end of each variable for each high-ascii character, so that the length can be brought to its proper target? (note: I am assuming there is no directive that addresses this specifically, and the whole lot of pack directives is mind-muddling)
update an example where the second line shifts positions based on accented characters
CNFrigo 539 Via Privata Da Via Iseo 6C 20098San Giuliano Milanese MI02 98282410 02 98287686 12886480156 12886480156 Bo3 Euro Giuseppe Frigo Transport 349 2803433 M.Gianoli#Delanchy.Fr S.Galliard#Delanchy.Fr
CNIn's M 497 Via Istituto S.Maria della Pietà, 30173Venezia Ve041 8690111 340 6311408 0041 5136113 00115180283 02896940273 B60Fm Euro Per Documentazioni Tecniche Inviare Materiale A : Silvia_Scarpa#Insmercato.It Amministrazione : Michela_Bianco#Insmercato.It Silvia Scarpa Per Liberatorie 041/5136171 Sig.Ra Bianco Per Pagamento Fatture 041/5136111 (Solo Il Giovedi Pomeriggio Dalle 14 All Beniservizi.Insmercato#Pec.Gruppopam.It

It looks like you are trying to use pack to format strings to fixed width columns for display. That’s not what it’s for, it is generally used for packing data into fixed byte structures for things like network protocols.
You probably want to use a format string instead, which is better suited for manipulating data for display.
Have a look at String#% (i.e. the % method on string). Like pack it uses another little language which is defined in Kernel#sprintf.
Taking a simplified example, with the two arrays:
plain = ["Iseo", "Next field"]
accent = ["Pietà", "Next field"]
then using pack like this:
puts plain.pack("A10A10")
puts accent.pack("A10A10")
will produce a result that looks like this, where “Next field” isn’t aligned since pack is dealing with the width in bytes, not the displayed width:
Iseo Next field
Pietà Next field
Using a format string, like this:
puts "%-10s%-10s" % plain
puts "%-10s%-10s" % accent
produces the desired result, since it is dealing with the displayable width:
Iseo Next field
Pietà Next field

Related

How to Generate Random String using Laravel Faker?

is there any way or method to generate fake string using laravel faker ?
like in laravel we generate string upto 20 chars..
str_random(20);

Faker offers a couple of methods that let you replace placeholders in a given string with random characters:
lexify - takes given string and replaces ? with random letters
asciify - takes given string and replaces * with random ascii characters
numerify - takes given string and replaces # with random digits
bothify - combines the lexify and numerify
You could try to use one of them, depending on the requirements you have for that random string you need. asciify uses the largest set of characters as replacement so using that one makes most sense.
The following will give you a random string of 20 ascii characters:
$faker->asciify('********************')

Alternate for generate string without special chars.
$faker->regexify('[A-Za-z0-9]{20}')

$faker->text($maxNbChars = 50);

$faker->text()
// generates 50 char by default: "Aut quo omnis placeat eos omnis eos."
$faker->text(10);
// generates 10 char by default: "Labore."
All texts seems to be one or more latin pseudo-sentences with spaces and always a dot in the end (of each sentence).

uze Faker\Provider\en_US\Text
<?php
realText($maxNbChars = 200, $indexSize = 2) // "And yet I wish you could manage it?) 'And what are they made of?' Alice asked in a shrill, passionate voice. 'Would YOU like cats if you were never even spoke to Time!' 'Perhaps not,' Alice replied."

How do I convert hex to binary (and vice versa) in Ruby, WHILE maintaining leading zeroes?

I have a data structure that I'd like to convert back and forth from hex to binary in Ruby. The simplest approach for a binary to hex is '0010'.to_i(2).to_s(16) - unfortunately this does not preserve leading zeroes (due to the to_i call), as one may need with data structures like cryptographic keys (which also vary with the number of leading zeroes).
Is there an easy built in way to do this?

I think you should have a firm idea of how many bits are in your cryptographic key. That should be stored in some constant or variable in your program, not inside individual strings representing the key:
KEY_BITS = 16
The most natural way to represent a key is as an integer, so if you receive a key in a hex format you can convert it like this (leading zeros in the string do not matter):
key = 'a0a0'.to_i(16)
If you receive a key in a (ASCII) binary format, you can convert it like this (leading zeros in the string do not matter):
key = '101011'.to_i(2)
If you need to output a key in hex with the right number of leading zeros:
key.to_s(16).rjust((KEY_BITS+3)/4, '0')
If you need to output a key in binary with the right number of leading zeros:
key.to_s(2).rjust(KEY_BITS, '0')
If you really do want to figure out how many bits might be in a key based on a (ASCII) binary or hex string, you can do:
key_bits = binary_str.length
key_bits = hex_str.length * 4

The truth is, leading zeros are not part of the integer value. I mean, it's a little detail related to representation of this value, not the value itself. So if you want to preserve properties of representation, it may be best not to get to underlying values at all.
Luckily, hex<->binary conversion has one neat property: each hexadecimal digit exactly corresponds to 4 binary digits. So assuming you only get binary numbers that have number of digits divisible by 4 you can just construct two dictionaries for constructing back and forth:
# Hexadecimal part is easy
hex = [*'0'..'9', *'A'..'F']
# Binary... not much longer, but a bit trickier
bin = (0..15).map { |i| '%04b' % i }
Note the use of String#% operator, that formats the given value interpreting the string as printf-style format string.
Okay, so these are lists of "digits", 16 each. Now for the dictionaries:
hex2bin = hex.zip(bin).to_h
bin2hex = bin.zip(hex).to_h
Converting hex to bin with these is straightforward:
"DEADBEEF".each_char.map { |d| hex2bin[d] }.join
Converting back is not that trivial. I assume we have a "good number" that can be split into groups of 4 binary digits each. I haven't found a cleaner way than using String#scan with a "match every 4 characters" regex:
"10111110".scan(/.{4}/).map { |d| bin2hex[d] }.join
The procedure is mostly similar.
Bonus task: implement the same conversion disregarding my assumption of having only "good binary numbers", i. e. "110101".
"I-should-have-read-the-docs" remark: there is Hash#invert that returns a hash with all key-value pairs inverted.

This is the most straightforward solution I found that preserves leading zeros. To convert from hexadecimal to binary:
['DEADBEEF'].pack('H*').unpack('B*').first # => "11011110101011011011111011101111"
And from binary to hexadecimal:
['11011110101011011011111011101111'].pack('B*').unpack1('H*') # => "deadbeef"
Here you can find more information:
Array#pack: https://ruby-doc.org/core-2.7.1/Array.html#method-i-pack
String#unpack1 (similar to unpack): https://ruby-doc.org/core-2.7.1/String.html#method-i-unpack1

Convert escaped unicode (\u008E) to accented character (Ž) in Ruby?

I am having a very difficult time with this:
# contained within:
"MA\u008EEIKIAI"
# should be
"MAŽEIKIAI"
# nature of string
$ p string3
"MA\u008EEIKIAI"
$ puts string3
MAEIKIAI
$ string3.inspect
"\"MA\\u008EEIKIAI\""
$ string3.bytes
#<Enumerator: "MA\u008EEIKIAI":bytes>
Any ideas on where to start?
Note: this is not a duplicate of my previous question.

\u008E means that the unicode character with the codepoint 8e (in hex) appears at that point in the string. This character is the control character “SINGLE SHIFT TWO” (see the code chart (pdf)). The character Ž is at the codepoint u017d. However it is at position 8e in the Windows CP-1252 encoding. Somehow you’ve got your encodings mixed up.
The easiest way to “fix” this is probably just to open the file containing the string (or the database record or whatever) and edit it to be correct. The real solution will depend on where the string in question came from and how many bad strings you have.
Assuming the string is in UTF-8 encoding, \u008E will consist of the two bytes c2 and 8e. Note that the second byte, 8e, is the same as the encoding of Ž in CP-1252. On way to convert the string would be something like this:
string3.force_encoding('BINARY') # treat the string just as bytes for now
string3.gsub!(/\xC2/n, '') # remove the C2 byte
string3.force_encoding('CP1252') # give the string the correct encoding
string3.encode('UTF-8') # convert to the desired encoding
Note that this isn’t a general solution to fix all issues like this. Not all CP-1252 characters, when mangled and expressed in UTF-8 this way will amenable to conversion like this. Some will be two bytes c2 xx where xx the correct byte (like in this case), others will be c3 yy where yy is a different byte.

What about using Regexp & String#pack to convert the Unicode escape?
str = "MA\\u008EEIKIAI"
puts str #=> MA\u008EEIKIAI
str.gsub!(/\\u(.{4})/) do |match|
[$1.to_i(16)].pack('U')
end
puts str #=> MA EIKIAI

Ruby: Fuzzing through all unicode characters ‎(UTF8/Encoding/String Manipulation)

I can't iterate over the entire range of unicode characters.
I searched everywhere...
I am building a fuzzer and want to embed into a url, all unicode characters (one at a time).
For example:
http://www.example.com?a=\uff1c
I know that there are some built tools but I need more flexibility.
If i could do someting like the following: "\u" + "ff1c" it would be great.
This is the closest I got:
char = "\u0000"
...
#within iteration
char.succ!
...
but after the character "\u0039", which is the number 9, I will get "10" instead of ":"

You could use pack to convert numbers to UTF8 characters but I'm not sure if this solves your problem.
You can either create an array with numeric values of all the characters and use pack to get an UTF8 string or you can just loop from 0 to whatever you need and use pack within the loop.
I've written a small example to explain myself. The code below prints out the hex value of each character followed by the character itself.
0.upto(100) do |i|
puts "%04x" % i + ": " + [i].pack("U*")
end

Here's some simpler code, albeit slightly obfuscated, that takes advantage of the fact that Ruby will convert an integer on the right hand side of the << operator to a codepoint. This only works with Ruby 1.8 up for integer values <= 255. It will work for values greater than 255 in 1.9.
0.upto(100) do |i|
puts "" << i
end

How to remove these kind of symbols (junk) from string?

Imagine I have String in C#: "I DonÃ¢â‚¬â„¢t see ya.."
I want to remove (replace to nothing or etc.) these "Ã¢â‚¬â„¢" symbols.
How do I do this?

That 'junk' looks a lot like someone interpreted UTF-8 data as ISO 8859-1 or Windows-1252, probably repeatedly.
Ã¢â‚¬â„¢ is the sequence C3 A2, E2 82 AC, E2 84 A2.
UTF-8 C3 A2 = U+00E2 = â
UTF-8 E2 82 AC = U+20AC = €
UTF-8 E2 84 A2 = U+2122 = ™
We then do it again: in Windows 1252 this sequence is E2 80 99, so the character should have been U+2019, RIGHT SINGLE QUOTATION MARK (’)
You could make multiple passes with byte arrays, Encoding.UTF8 and Encoding.GetEncoding(1252) to correctly turn the junk back into what was originally entered. You will need to check your processing to find the two places that UTF-8 data was incorrectly interpreted as Windows-1252.

"I DonÃ¢â‚¬â„¢t see ya..".Replace( "Ã¢â‚¬â„¢", string.Empty);
How did that junk get in there the first place? That's the real question.

By removing any non-latin character you'll be intentionally breaking some internationalization support.
Don't forget the poor guy who's name has a "â" in it.

This looks disturbingly familiar to a character encoding issue dealing with the Windows character set being stored in a database using the standard character encoding. I see someone voted Will down, but he has a point. You may be solving the immediate issue, but the combinations of characters are limitless if this is the issue.

If you really have to do this, regular expressions are probably the best solution.
I would strongly recommend that you think about why you have to do this, though - at least some of the characters your listing as undesirable are perfectly valid and useful in other languages, and just filtering them out will most likely annoy at least some of your international users. As a swede, I can't emphasize enough how much I hate systems that can't handle our å, ä and ö characters correctly.

Consider Regex.Replace(your_string, regex, "") - that's what I use.

Test each character in turn to see if it is a valid alphabetic or numeric character and if not then remove it from the string. The character test is very simple, just use...
char.IsLetterOrDigit;
Please there are various others such as...
char.IsSymbol;
char.IsControl;

Regex.Replace("The string", "[^a-zA-Z ]","");
That's how you'd do it in C#, although that regular expression ([^a-zA-Z ]) should work in most languages.
[Edited: forgot the space in the regex]

The ASCII / Integer code for these characters would be out of the normal alphabetic Ranges. Seek and replace with empty characters. String has a Replace method I believe.

Either use a blacklist of stuff you do not want, or preferably a white list (set). With a white list you iterate over the string and only copy the letters that are in your white list to the result string. You said remove, and the way you do that is having two pointers one you read from (R) and one you write to (W):
I DonÃ¢â‚
W R
if comma is in your whitelist then you would in this case read the comma and write it where Ã is then advance both pointers. UTF-8 is a multi-byte encoding, so you advancing the pointer may not just be adding to the address.
With C an easy to way to get a white list by using one of the predefined functions (or macros): isalnum, isalpha, isascii, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, isxdigit. In this case you send up with a white list function instead of a set of course.
Usually when I see data like you have I look for memory corruption, or evidence to suggest that the encoding I expect is different than the one the data was entered with.
/Allan

I had the same problem with extraneous junk thrown in by adobe in an EXIF dump. I spent an hour looking for a straight answer and trying numerous half-baked suggestions which did not work here.
This thread more than most I have read was replete with deep, probing questions like 'how did it get there?', 'what if somebody has this character in their name?', 'are you sure you want to break internationalization?'.
There were some impressive displays of erudition positing how this junk could have gotten here and explaining the evolution of the various character encoding schemes. The person wanted to know how to remove it, not how it came to be or what the standards orgs are up to, interesting as this trivia may be.
I wrote a tiny program which gave me the right answer. Instead of paraphrasing the main concept, here is the entire, self-contained, working (at least on my system) program and the output I used to nuke the junk:
#!/usr/local/bin/perl -w
# This runs in a dos window and shows the char, integer and hex values
# for the weird chars. Install the HEX values in the REGEXP below until
# the final test line looks normal.
$str = 's: â€œBrian'; # Nuke the 3 werid chars in front of Brian.
#str = split(//, $str);
printf("len str '$str' = %d, scalar \#str = %d\n",
length $str, scalar #str);
$ii = -1;
foreach $c (#str) {
$ii++;
printf("$ii) char '$c', ord=%03d, hex='%s'\n",
ord($c), unpack("H*", $c));
}
# Take the hex characters shown above, plug them into the below regexp
# until the junk disappears!
($s2 = $str) =~ s/[\xE2\x80\x9C]//g; # << Insert HEX values HERE
print("S2=>$s2<\n"); # Final test
Result:
M:\new\6s-2014.1031-nef.halloween>nuke_junk.pl
len str 's: GÇ£Brian' = 11, scalar #str = 11
0) char 's', ord=115, hex='73'
1) char ':', ord=058, hex='3a'
2) char ' ', ord=032, hex='20'
3) char 'G', ord=226, hex='e2'
4) char 'Ç', ord=128, hex='80'
5) char '£', ord=156, hex='9c'
6) char 'B', ord=066, hex='42'
7) char 'r', ord=114, hex='72'
8) char 'i', ord=105, hex='69'
9) char 'a', ord=097, hex='61'
10) char 'n', ord=110, hex='6e'
S2=>s: Brian<
It's NORMAL!!!
One other actionable, working suggestion I ran across:
iconv -c -t ASCII < 6s-2014.1031-238246.halloween.exf.dif > exf.ascii.dif

If String having the any Junk date , This is good to way remove those junk date
string InputString = "This is grate kingdom¢Ã‚¬â";
string replace = "Ã¢â‚¬â„¢";
string OutputString= Regex.Replace(InputString, replace, "");
//OutputString having the following result
It's working good to me , thanks for looking this review.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ruby pack and Latin (high-ASCII) characters - ruby

Related

How to Generate Random String using Laravel Faker?

How do I convert hex to binary (and vice versa) in Ruby, WHILE maintaining leading zeroes?

Convert escaped unicode (\u008E) to accented character (Ž) in Ruby?

Ruby: Fuzzing through all unicode characters ‎(UTF8/Encoding/String Manipulation)

How to remove these kind of symbols (junk) from string?

Categories

Resources