I want to convert a string object to bytes and Vice-versa.
Is it possible using Ruby Packer/Unpacker?
I am unable to find the format specifier to use
*pack_object = "Test".pack('**x**')* where x is format specifier
*unpacked_object = pack_object.unpack('**x**')* , this should result in "Test" string
String has a bytes method that returns an array of integers:
'Type'.bytes
#=> [84, 121, 112, 101]
The equivalent unpack directive is C*: (as already noted by cremno)
'Type'.unpack('C*')
#=> [84, 121, 112, 101]
Or the other way round:
[84, 121, 112, 101].pack('C*')
#=> "Type"
Note that pack returns a string in binary encoding.
Regarding your comment:
The output which i need is the same strung which i packed
pack and unpack are counterparts, so you can use all kind of directives:
'Type'.unpack('b*')
#=> ["00101010100111100000111010100110"]
['00101010100111100000111010100110'].pack('b*')
#=> 'Type'
Related
tldr: I want to convert [125, 119, 48, 126, 40] to output string, }w0~(
To give a real life example, I am working with sequence data in fastq format (Here is a link to the library imported).
cat example.fastq outputs the following:
#some/random/identifier
ACTAG
+
}w0~(
The julia code below demonstrates reading the fastq file:
import BioSequences.FASTQ
fastq_stream = FASTQ.Reader(open("example.fastq", "r"))
for record in fastq_stream
# Still need to learn, why this offset of 33?
println(
Vector{Int8}(FASTQ.quality(record, :sanger)) .+ 33
)
println(
String(FASTQ.sequence(record))
)
println(
String(FASTQ.identifier(record))
)
break
end
close(fastq_stream)
This code prints the following:
[125, 119, 48, 126, 40]
ACTAG
some/random/identifier
I don't want to have to store this information in a list. I would prefer to convert it to string. So the output I am looking for here is:
}w0~(
ACTAG
some/random/identifier
julia> String(UInt8.([125, 119, 48, 126, 40]))
"}w0~("
Explanation
in Julia Strings are constructed using a set of bytes. If you are using ASCII only the char-byte mapping is simple and you can directly work on raw data (which is also the fastest way to do that).
Note that since Julia Strings are immutable, when creating String from raw bytes, the initial bytes become unavailable - this also means that no data is copied in the String creation process. Have a look at the example below:
julia> mybytes = UInt8.([125, 119, 48, 126, 40]);
julia> mystring = String(mybytes)
"}w0~("
julia> mybytes
0-element Array{UInt8,1}
Performance note
Strings in Julia are not internalized. In analytics scenarios always consider using Symbols instead of Strings. In some scenarios using temperature=:hot instead of temperature="hot" can mean 3x shorter execution time.
EDIT - performance test
julia> using Random, BenchmarkTools;Random.seed!(0);
bb = rand(33:126,1000);
julia> #btime join(Char.($bb));
31.573 μs (13 allocations: 6.56 KiB)
julia> #btime String(UInt8.($bb));
711.111 ns (2 allocations: 2.13 KiB)
String(UInt8.($bb)) is over 40x faster and uses 1/3 of the memory
I found a workable solution for now. I am sure there are more efficient solutions out there.
join(Char(i) for i in Vector{Int8}(FASTQ.quality(record, :sanger)) .+ 33) produces the output I require.
I need to remove non UTF-8 characters from a string. Here is the snap of the text.
This is how it looks like when I open the string in NPP, and then set the encoding to UTF-8:
I think the ACK and FF are non UTF-8 characters.
I tried str.scrub as well as str.encode. Neither of them seems to work. scrub returns the same result, and encode results in an error.
We have a few problems.
The biggest is that a Ruby String stores arbitrary bytes along with a supposed encoding, with no guarantee that the bytes are valid in that encoding and with no obvious reason for that encoding to have been chosen. (I might be biased as a heavy user of Python 3. We would never speak of "changing a string from one encoding to another".)
Fortunately, the editor did not eat your post, but it's hard to see that. I'm guessing that you decoded the string as Windows-1252 in order to display it, which only obscures the issue.
Here's your string of bytes as I see it:
>> s = "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K".b
=> "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K"
>> s.bytes
=> [6, 45, 126, 36, 65, 167, 114, 117, 71, 249, 34, 154, 12, 182, 47, 75]
And it does contain bytes that are not valid UTF-8.
>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> String::new(s).force_encoding(Encoding::UTF_8).valid_encoding?
=> false
We can ask to decode this as UTF-8 and insert � where we encounter bytes that are not valid UTF-8:
>> s.encode('utf-8', 'binary', :undef => :replace)
=> "\u0006-~$A�ruG�\"�\f�/K"
What is the difference between ruby string functions:- codepoints and bytes
'abcd'.bytes
=> [97, 98, 99, 100]
'abcd'.codepoints
=> [97, 98, 99, 100]
bytes returns individual bytes, regardless of char size, whereas codepoints returns unicode codepoints.
s = '日本語'
s.bytes # => [230, 151, 165, 230, 156, 172, 232, 170, 158]
s.codepoints # => [26085, 26412, 35486]
s.chars # => ["日", "本", "語"]
I see where your confusion arises from. Ruby uses utf-8 encoding by default now and utf-8 was specifically designed so that its first codepoints (0-127) are exactly the same as in ASCII encoding. ASCII is an encoding with one-byte chars, so in examples in your question methods bytes and codepoints return the same values, coincindentally.
So, if you need to break string into characters, use either chars or codepoints (whichever is appropriate for your use case). Use bytes only when you treat string as an opaque binary blob, not text.
Actually, chars (suggested above) might not be accurate enough, since unicode has notion of combining characters and modifier letters. If you care about this, you need to use so-called "grapheme clusters". Here's an example (taken from this answer:
s = "a\u0308\u0303\u0323\u032d"
s.bytes # => [97, 204, 136, 204, 131, 204, 163, 204, 173]
s.codepoints # => [97, 776, 771, 803, 813]
s.chars # => ["a", "̈", "̃", "̣", "̭"]
s.grapheme_clusters # => ["ạ̭̈̃"] # rendering of this glyph is kinda broken, which illustrates the point that unicode is hard
Do you know a better, faster, smarter, efficent or just more elegat way of doing the following ?
due this array
a = [171, 209, 3808, "723", "288", "6", "5", 27, "22", 207, 473, "256", 67, 1536]
get this
a.map{|i|i.to_i}.sort{|a,b|b<=>a}
=> [3808, 1536, 723, 473, 288, 256, 209, 207, 171, 67, 27, 22, 6, 5]
You can use in-place mutations to avoid creating new arrays:
a.map!(&:to_i).sort!.reverse!
Hard to know if it's faster or more efficient without a benchmark, though.
Here's one using symbol#to_proc
a.map(&:to_i).sort.reverse
This is faster than using in-place modifier (!) methods but uses more memory. As a bonus, it keeps the original array a intact if you want to do anything else with it.
i was palying with the ruby sockets, so i ended up trying to put an IP packet togather, then i took an ip packet and try to make a new one just like it.
now my problem is: if the packet is: 45 00 00 54 00 00 40 00 40 01 06 e0 7f 00 00 01 7f 00 00 01, and this is obviously hexadecimal, so i converted it into a decimal, then into a binary data using the .pack method, and pass it up to the send method, then the Wireshark shows me a very strange different thing from what i created, i doing something wrong ???, i know that, but can't figure it out:
#packet = 0x4500005400004000400106e07f0000017f000001 #i converted each 32 bits together, not like i wrote
#data = ""
#data << #packet.to_s
#socket.send(#data.unpack(c*).to_s,#address)
and is there another way to solve the whole thing up, can i for example write directly to the socket buffer the data i want to send??
thanks in advance.
Starting with a hex Bignum is a novel idea, though I can't immediately think of a good way to exploit it.
Anyway, trouble starts with the .to_s on the Bignum, which will have the effect of creating a string with the decimal representation of your number, taking you rather further from the bits and not closer. Somehow your c* seems to have lost its quotes, also.
But putting them back, you then unpack the string, which gets you an array of integers which are the ascii values of the digits in the decimal representation of the numeric value of the original hex string, and then you .to_s that (which IO would have done anyway, so, no blame there at least) but this then results in a string with the printable representation of the ascii numbers of the unpacked string, so you are now light-years from the original intention.
>> t = 0x4500005400004000400106e07f0000017f000001
=> 393920391770565046624940774228241397739864195073
>> t.to_s
=> "393920391770565046624940774228241397739864195073"
>> t.to_s.unpack('c*')
=> [51, 57, 51, 57, 50, 48, 51, 57, 49, 55, 55, 48, 53, 54, 53, 48, 52, 54, 54, 50, 52, 57, 52, 48, 55, 55, 52, 50, 50, 56, 50, 52, 49, 51, 57, 55, 55, 51, 57, 56, 54, 52, 49, 57, 53, 48, 55, 51]
>> t.to_s.unpack('c*').to_s
=> "515751575048515749555548535453485254545052575248555552505056505249515755555157565452495753485551"
It's kind of interesting in a way. All the information is still there, sort of.
Anyway, you need to make a binary string. Either just << numbers into it:
>> s = ''; s << 1 << 2
=> "\001\002"
Or use Array#pack:
>> [1,2].pack 'c*'
=> "\001\002"
First check your host byte order because what you see in wireshark is in network byte order (BigEndian). Then in wireshark you will be seeing protocol headers (depends upon whether it is TCP socket or a UDP one) followed by data. You can not directly send IP packets. So you can see this particular data in the particular's packet's data section i.e. (data section of TCP/UDP packet).