Impala substr can't get utf8 character correctly - hadoop

I am new to ETL and I was assigned with a task on sanitizing some sensitive information before giving the data to a client.
I am using HUE web client with Impala.
What I want to do is:
For example, a column info like '京客隆(三里屯店)', then I need to transform it into something like '京XXX店)' .
My query is:
select '京客隆(三里屯店)', concat(substr('京客隆(三里屯店)', 1, 3), 'XXX', substr('京客隆(三里屯店)', char_length('京客隆(三里屯店)') -6, 6));
But I get gibberish in the output:
'京客隆(三里屯店)' | concat(substr('京客隆(三里屯店)', 1, 3), 'xxx', substr('京客隆(三里屯店)', char_length('京客隆(三里屯店)') - 6, 6))
京客隆(三里屯店) | 京XXX�店�
The problem is that :
select '京客隆(三里屯店)', substr('京客隆(三里屯店)', char_length('京客隆(三里屯店)') -3 , 3);
output: 京客隆(三里屯店) ��
doesn't get the correct characaters. Why is that? I pasted the string in python shell and I can get the correct characters if I only take the last 3 bytes.

It turns out that I misunderstood the function substr.
substr(STRING a, INT start [, INT len]) :
It takes characters starting from (including) INT start. So for example my string '京客隆(三里屯店)' is 27 bytes long in total, and each utf8 char takes 3 bytes here. I need to take the last 3 bytes, which is the ) , then I need to write:
substr('京客隆(三里屯店), 27 - 2 ,3 ) .
It then gets the 25, 26, 27 3 bytes and display the char ) correctly.
Updated:
I was told to use :
SELECT regexp_replace('京客隆(三里屯店)', '(.)(.*)(.{2})', '\\1***\\3');
works like an charm :P.

Related

Dynamic number system in Qlik Sense

My data consists of large numbers, I have a column say - 'amount', while using it in charts(sum of amount in Y axis) it shows something like 1.4G, I want to show them as if is billion then e.g. - 2.8B, or in millions then 80M or if it's in thousands (14,000) then simply- 14k.
I have used - if(sum(amount)/1000000000 > 1, Num(sum(amount)/1000000000, '#,###B'), Num(sum(amount)/1000000, '#,###M')) but it does not show the M or B at the end of the figure and also How to include thousand in the same code.
EDIT: Updated to include the dual() function.
This worked for me:
=dual(
if(sum(amount) < 1, Num(sum(amount), '#,##0.00'),
if(sum(amount) < 1000, Num(sum(amount), '#,##0'),
if(sum(amount) < 1000000, Num(sum(amount)/1000, '#,##0k'),
if(sum(amount) < 1000000000, Num(sum(amount)/1000000, '#,##0M'),
Num(sum(amount)/1000000000, '#,##0B')
))))
, sum(amount)
)
Here are some example outputs using this script to format it:
=sum(amount)
Formatted
2,526,163,764
3B
79,342,364
79M
5,589,255
5M
947,470
947k
583
583
0.6434
0.64
To get more decimals for any of those, like 2.53B instead of 3B, you can format them like '#,##0.00B' by adding more zeroes at the end.
Also make sure that the Number Formatting property is set to Auto or Measure expression.

strconv.ParseInt fails if number starts with 0

I'm currently having issues parsing some numbers starting with 0 in Go.
fmt.Println(strconv.ParseInt("0491031", 0, 64))
0 strconv.ParseInt: parsing "0491031": invalid syntax
GoPlayground: https://go.dev/play/p/TAv7IEoyI8I
I think this is due to some base conversion error, but I don't have ideas about how to fix it.
I'm getting this error parsing a 5GB+ csv file with gocsv, if you need more details.
[This error was caused by the GoCSV library that doesn't allow to specify a base for the numbers you're going to parse.]
Quoting from strconv.ParseInt()
If the base argument is 0, the true base is implied by the string's prefix following the sign (if present): 2 for "0b", 8 for "0" or "0o", 16 for "0x", and 10 otherwise. Also, for argument base 0 only, underscore characters are permitted as defined by the Go syntax for integer literals.
You are passing 0 for base, so the base to parse in will be inferred from the string value, and since it starts with a '0' followed by a non '0', your number is interpreted as an octal (8) number, and the digit 9 is invalid there.
Note that this would work:
fmt.Println(strconv.ParseInt("0431031", 0, 64))
And output (try it on the Go Playground):
143897 <nil>
(Octal 431031 equals 143897 decimal.)
If your input is in base 10, pass 10 for base:
fmt.Println(strconv.ParseInt("0491031", 10, 64))
Then output will be (try it on the Go Playground):
491031 <nil>

How to split a string by amount of characters in a batch file?

I have about 6GB of various text files, the files have many lines but each record is missing its commas so all the data is in 1 record. I want to create a batch file where I can add commas at the appropriate places in each "record". I'm hoping to add commas so I can then import this into a database.
For example the file would be structured like this.
IDnameADDRESSphoneEMAILetc
IDnameADDRESSphoneEMAILetc
IDnameADDRESSphoneEMAILetc
Each field has a unique length which I know, and it's static between all files.
For example
ID - 10 characters
NAME - 40 characters
ADDRESS - 30 characters
etc
This will need to be run on an ongoing basis as new files come in so I'm hoping for something I can give a non technical person they can just run.
Any quick way to do this in a bat file?
Using your example above. Note we count the characters starting from 0, then tell the set to use letters starting at a certain count, counting the word length from there. See bottom for layout.
#echo off
setlocal enabledelayedexpansion
for /F "tokens=* delims=" %%a in (filename.txt) do (
set str=%%a
set id=!str:~0,2!
set na=!str:~2,4!
set add=!str:~6,7!
set ph=!str:~13,5!
set em=!str:~18,5!
set etc=!str:~23,3!
echo !id!,!na!,!add!,!ph!,!em!,!etc!
)
Characters assigned in a string as:
I D n a m e A D D R E S S p h o n e E M A I L e t c
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
ID starts at Character 0 and is 2 characters, including itself :~0,2
name starts at character 2 and is 4 characters long :~2,4
etc..
For many files just add another loop as a main loop or give a list of files.
Based on your provided example, here is a quick powershell command, (despite no tag):
(GC 'Report.txt' | Select -First 1).Insert(10,',').Insert(51,',').Insert(82,',') > 'Fixed.txt'
It takes the first line of Report.txt…
After 10 characters insert ,(0 + 10 = 10) + 1
After another 40 characters insert ,(11 + 40 = 51) + 1
After another 30 characters insert ,(52 + 30 = 82) + 1
etc.
…then outputs the line complete with insertions to Fixed.txt
Just continue the .Insert(<number>,',') sequence for your other fixed width column sizes and ensure you've changed the filenames to suit your circumstances.
Edit
The following as an update to your comment and subsequent edit should work for all lines in the file.
GC 'Report.txt' | % {($_).Insert(10,',').Insert(51,',').Insert(82,',')} | Out-File 'Fixed.txt'

Wireshark: read 8 bytes of timestamp

I'm new to writing dissectors in 'C' and I came across the need to read 8 bytes timestamp from a packet.
I'm trying the following code:
g_print("offset=%d, starttime=0x%08x\n", offset, tvb_get_letoh64(tvb, offset));
and I get:
offset=8, starttime=0x0362ea14
which is only 4 bytes out of the 8 I was expecting.
How can I read it so the output would be:
offset=8, starttime=0x14ea620305779840
I also tried reading it using:
g_print("offset=%d, starttime=0x%08x\n", offset, tvb_get_bits64(tvb, 64, 32, ENC_LITTLE_ENDIAN));
g_print("offset=%d, starttime=0x%08x\n", offset, tvb_get_bits64(tvb, 64, 64, ENC_LITTLE_ENDIAN));
and it printed the 4 first bytes of the timestamp and the 2nd call printed the last 4 bytes. I'm missing something very basic...
2nd question, ok, let's assume I get the value right and convert it into nstime_t, How can I format this into a Date\time format, something like:
YYYY-MM-DDZHH:MM:SS:MMMM
Thank you so much!
What output do you get with this?
g_print("offset=%d, starttime=0x%08lx\n", offset, tvb_get_letoh64(tvb, offset));
As for your 2nd question, what is the meaning of these 8 bytes? Maybe you can declare your hf variable using FT_ABSOLUTE_TIME and use something like proto_tree_add_time(), proto_tree_add_time_item(), proto_tree_add_time_format_value() or proto_tree_add_time_format()?

What could be Regex for reading first line of file with first few bytes and then rest of file content except last 8 bytes of last line of file?

I am processing one binary file in which I want to retrieve first 4 bytes, next 4 bytes, again 4 bytes and then rest of the file contents except last 8 bytes of last line.
I have tried with this line file.read.scan(/(.{4})(.{4})(.{4})(.*\w)(.{8})/).each do |a,b,c,d,e| but after some iterations this regex starts from some line in between with first 4 bytes, next 4 bytes, next 4 bytes pattern. Because of this my condition check fails.
I want to do following.
Read first 4 bytes of first line of file, then bytes 5 to 7, then bytes 8 to 11, then rest of the file content except last 8 bytes of last line of the file.
What could be the regex for this in Ruby?
Use #read instead of a regexp:
f = File.open(file_name,"rb")
chunk1 = f.read(4)
chunk2 = f.read(3)
chunk3 = f.read(4)
chunk4 = f.read(f.size - (4 + 3 + 4 + 8))
How about:
/(.{4})(.{3})(.{4})(.*).{8}/m

Resources