Get more informative data set produced from Flume+Kafka - hadoop

Kafka, I have configured my Flume Job with Kafka as the source I'm getting Data in my folder which is not informative is there something needs to be changed in my configuration flume
FLume JOb Config
#source
MY_AGENT.sources.my-source.type = org.apache.flume.source.kafka.KafkaSource
MY_AGENT.sources.my-source.channels = my-channel
MY_AGENT.sources.my-source.batchSize = 10000
MY_AGENT.sources.my-source.useFlumeEventFormat = false
MY_AGENT.sources.my-source.batchDurationMillis = 5000
MY_AGENT.sources.my-source.kafka.bootstrap.servers =${BOOTSTRAP_SERVERS}
MY_AGENT.sources.my-source.kafka.topics = my-topic
MY_AGENT.sources.my-source.kafka.consumer.group.id = my-topic_grp
MY_AGENT.sources.my-source.kafka.consumer.client.id = my-topic_clnt
MY_AGENT.sources.my-source.kafka.compressed.topics = my-topic
MY_AGENT.sources.my-source.kafka.auto.commit.enable = false
MY_AGENT.sources.my-source.kafka.consumer.session.timeout.ms=100000
MY_AGENT.sources.my-source.kafka.consumer.request.timeout.ms=120000
MY_AGENT.sources.my-source.kafka.consumer.max.partition.fetch.bytes=704857
MY_AGENT.sources.my-source.kafka.consumer.auto.offset.reset=latest
#channel
MY_AGENT.channels.my-channel.type = memory
MY_AGENT.channels.my-channel.capacity = 100000000
MY_AGENT.channels.my-channel.transactionCapacity = 100000
MY_AGENT.channels.my-channel.parseAsFlumeEvent = false
#Sink
MY_AGENT.sinks.my-sink.channel = my-channel
MY_AGENT.sinks.my-sink.type = hdfs
MY_AGENT.sinks.my-sink.hdfs.writeFormat= Text
MY_AGENT.sinks.my-sink.hdfs.fileType = DataStream
MY_AGENT.sinks.my-sink.hdfs.kerberosPrincipal =${user}
MY_AGENT.sinks.my-sink.hdfs.kerberosKeytab =${keytab}
MY_AGENT.sinks.my-sink.hdfs.useLocalTimeStamp = true
MY_AGENT.sinks.my-sink.hdfs.path = hdfs://nameservice1/my_hdfs/my_table1/timestamp=%Y%m%d
MY_AGENT.sinks.my-sink.hdfs.rollCount=0
MY_AGENT.sinks.my-sink.hdfs.rollSize=0
MY_AGENT.sinks.my-sink.hdfs.batchSize=100000
MY_AGENT.sinks.my-sink.hdfs.maxOpenFiles=2000
MY_AGENT.sinks.my-sink.hdfs.callTimeout=50000
MY_AGENT.sinks.my-sink.hdfs.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
MY_AGENT.sinks.my-sink.hdfs.schema.registry.urr=l = ${SCHEMA_URL}
O/P of Data
0000000: 53 45 51 06 21 6f 72 67 2e 61 70 61 63 68 65 2e SEQ.!org.apache.
0000010: 68 61 64 6f 6f 70 2e 69 6f 2e 4c 6f 6e 67 57 72 hadoop.io.LongWr
0000020: 69 74 61 62 6c 65 22 6f 72 67 2e 61 70 61 63 68 itable"org.apach
0000030: 65 2e 68 61 64 6f 6f 70 2e 69 6f 2e 42 79 74 65 e.hadoop.io.Byte
0000040: 73 57 72 69 74 61 62 6c 65 00 00 00 00 00 00 85 sWritable.......
0000050: a6 6f 46 0c f4 16 33 a6 eb 43 c2 21 5c 1b 4f 00 .oF...3..C.!\.O.
0000060: 00 00 18 00 00 00 08 00 00 01 4d c6 1b 01 1f 00 ..........M.....
0000070: 00 00 0c 48 65 6c 6c 6f 20 48 44 46 53 21 0d ...Hello HDFS!.
This kind of Output I'm getting I was Expecting json kind of result can there is something needs to be changed in my config flume file

Related

Creating Gmail labels with Japanese characters

I've got some code to create labels in Gmail, which usually works fine. But now the requirement is to create a label with Japanese characters, specifically "アーカイブ". I am encoding the json like this:
7B 0D 0A 22 6E 61 6D 65 22 3A 22 E3 82 A2 E3 83 {.."name":".....
BC E3 82 AB E3 82 A4 E3 83 96 22 2C 0D 0A 22 6D ..........",.."m
65 73 73 61 67 65 4C 69 73 74 56 69 73 69 62 69 essageListVisibi
6C 69 74 79 22 3A 22 73 68 6F 77 22 2C 0D 0A 22 lity":"show",.."
6C 61 62 65 6C 4C 69 73 74 56 69 73 69 62 69 6C labelListVisibil
69 74 79 22 3A 22 6C 61 62 65 6C 53 68 6F 77 22 ity":"labelShow"
0D 0A 7D 0D 0A 00 00 00 00 00 00 00 00 00 00 00 ..}.............
As you can see, the first character is the UTF8 sequence E3 82 A2, which if you look at this table (https://www.utf8-chartable.de/unicode-utf8-table.pl?start=12352&names=-) seems to be correct for that first character. The others look OK also.
As a test, I created a Japanese folder with that name in the UI, then got a dump of the json that Gmail produces when I get a list of existing folders. What Gmail produces is exactly the same as what I'm trying to import. So I don't see what I could be doing wrong here. Any help appreciated.
Never mind this - turns out my Japanese characters translate to "Archive" which is apparently a reserved folder name.

How to read a hexadecimal file and convert the content to byte slice in golang?

The hexadecimal file is from the charles proxy's hex format, and it may contains invisible characters, the example format content is:
00000000 7b 22 73 75 70 70 6f 72 74 73 5f 69 6d 70 6c 69 {"supports_impli
00000010 63 69 74 5f 73 64 6b 5f 6c 6f 67 67 69 6e 67 22 cit_sdk_logging"
00000020 3a 74 72 75 65 2c 22 67 64 70 76 34 5f 6e 75 78 :true,"gdpv4_nux
00000030 5f 65 6e 61 62 6c 65 64 22 3a 66 61 6c 73 65 2c _enabled":false,
00000040 22 61 6e 64 72 6f 69 64 5f 73 64 6b 5f 65 72 72 "android_sdk_err
00000050 6f 72 5f 63 61 74 65 67 6f 72 69 65 73 22 3a 5b or_categories":[
00000060 7b 22 6e 61 6d 65 22 3a 22 6c 6f 67 69 6e 5f 72 {"name":"login_r
00000070 65 63 6f 76 65 72 61 62 6c 65 22 2c 22 69 74 65 ecoverable","ite
00000080 6d 73 22 3a 5b 7b 22 63 6f 64 65 22 3a 31 30 32 ms":[{"code":102
00000090 7d 2c 7b 22 63 6f 64 65 22 3a 31 39 30 7d 5d 2c },{"code":190}],
000000a0 22 72 65 63 6f 76 65 72 79 5f 6d 65 73 73 61 67 "recovery_messag
000000b0 65 22 3a 22 5c 75 38 62 66 37 5c 75 39 31 63 64 e":"\u8bf7\u91cd
000000c0 5c 75 36 35 62 30 5c 75 37 36 37 62 5c 75 35 66 \u65b0\u767b\u5f
000000d0 35 35 5c 75 35 65 39 34 5c 75 37 35 32 38 5c 75 55\u5e94\u7528\u
000000e0 37 61 30 62 5c 75 35 65 38 66 5c 75 66 66 30 63 7a0b\u5e8f\uff0c
000000f0 5c 75 35 31 38 64 5c 75 36 62 32 31 5c 75 38 66 \u518d\u6b21\u8f
00000100 64 65 5c 75 36 33 61 35 20 46 61 63 65 62 6f 6f de\u63a5 Faceboo
00000110 6b 20 5c 75 35 65 31 30 5c 75 36 32 33 37 5c 75 k \u5e10\u6237\u
00000120 33 30 30 32 22 7d 5d 2c 22 61 70 70 5f 65 76 65 3002"}],"app_eve
00000130 6e 74 73 5f 73 65 73 73 69 6f 6e 5f 74 69 6d 65 nts_session_time
00000140 6f 75 74 22 3a 36 30 2c 22 61 70 70 5f 65 76 65 out":60,"app_eve
00000150 6e 74 73 5f 66 65 61 74 75 72 65 5f 62 69 74 6d nts_feature_bitm
00000160 61 73 6b 22 3a 36 35 35 35 39 2c 22 73 65 61 6d ask":65559,"seam
00000170 6c 65 73 73 5f 6c 6f 67 69 6e 22 3a 31 2c 22 73 less_login":1,"s
00000180 6d 61 72 74 5f 6c 6f 67 69 6e 5f 62 6f 6f 6b 6d mart_login_bookm
00000190 61 72 6b 5f 69 63 6f 6e 5f 75 72 6c 22 3a 22 68 ark_icon_url":"h
000001a0 74 74 70 73 3a 5c 2f 5c 2f 73 74 61 74 69 63 2e ttps:\/\/static.
000001b0 78 78 2e 66 62 63 64 6e 2e 6e 65 74 5c 2f 72 73 xx.fbcdn.net\/rs
000001c0 72 63 2e 70 68 70 5c 2f 76 33 5c 2f 79 73 5c 2f rc.php\/v3\/ys\/
000001d0 72 5c 2f 43 36 5a 75 74 59 44 53 61 61 56 2e 70 r\/C6ZutYDSaaV.p
000001e0 6e 67 22 2c 22 73 6d 61 72 74 5f 6c 6f 67 69 6e ng","smart_login
000001f0 5f 6d 65 6e 75 5f 69 63 6f 6e 5f 75 72 6c 22 3a _menu_icon_url":
00000200 22 68 74 74 70 73 3a 5c 2f 5c 2f 73 74 61 74 69 "https:\/\/stati
00000210 63 2e 78 78 2e 66 62 63 64 6e 2e 6e 65 74 5c 2f c.xx.fbcdn.net\/
00000220 72 73 72 63 2e 70 68 70 5c 2f 76 33 5c 2f 79 73 rsrc.php\/v3\/ys
00000230 5c 2f 72 5c 2f 30 69 61 72 70 6e 77 64 6d 45 78 \/r\/0iarpnwdmEx
00000240 2e 70 6e 67 22 2c 22 72 65 73 74 72 69 63 74 69 .png","restricti
00000250 76 65 5f 64 61 74 61 5f 66 69 6c 74 65 72 5f 70 ve_data_filter_p
00000260 61 72 61 6d 73 22 3a 22 7b 7d 22 2c 22 61 61 6d arams":"{}","aam
00000270 5f 72 75 6c 65 73 22 3a 22 7b 7d 22 2c 22 73 75 _rules":"{}","su
00000280 67 67 65 73 74 65 64 5f 65 76 65 6e 74 73 5f 73 ggested_events_s
00000290 65 74 74 69 6e 67 22 3a 22 7b 5c 22 70 72 6f 64 etting":"{\"prod
000002a0 75 63 74 69 6f 6e 5f 65 76 65 6e 74 73 5c 22 3a uction_events\":
000002b0 5b 5d 2c 5c 22 65 6c 69 67 69 62 6c 65 5f 66 6f [],\"eligible_fo
000002c0 72 5f 70 72 65 64 69 63 74 69 6f 6e 5f 65 76 65 r_prediction_eve
000002d0 6e 74 73 5c 22 3a 5b 5c 22 66 62 5f 6d 6f 62 69 nts\":[\"fb_mobi
000002e0 6c 65 5f 61 64 64 5f 74 6f 5f 63 61 72 74 5c 22 le_add_to_cart\"
000002f0 2c 5c 22 66 62 5f 6d 6f 62 69 6c 65 5f 70 75 72 ,\"fb_mobile_pur
00000300 63 68 61 73 65 5c 22 2c 5c 22 66 62 5f 6d 6f 62 chase\",\"fb_mob
00000310 69 6c 65 5f 63 6f 6d 70 6c 65 74 65 5f 72 65 67 ile_complete_reg
00000320 69 73 74 72 61 74 69 6f 6e 5c 22 2c 5c 22 66 62 istration\",\"fb
00000330 5f 6d 6f 62 69 6c 65 5f 69 6e 69 74 69 61 74 65 _mobile_initiate
00000340 64 5f 63 68 65 63 6b 6f 75 74 5c 22 5d 7d 22 2c d_checkout\"]}",
00000350 22 69 64 22 3a 22 31 36 33 35 34 33 35 31 34 39 "id":"1635435149
00000360 30 39 30 34 35 22 7d 09045"}
Read the file and convert the hexadecimal content to golang slice []byte, is there any convenient method? how to split the lines in the file?
I have solved it by using fixed length of the hexadecimal content. the offset length of each line is 8, so the hexadecimal content begins at offset 8, and the max length of the hexadecimal is 57. The code is as follows:
const (
beginOffset = 8 // the offset of the hexadecimal content begins
endLength = 57 // the length of the hexadecimal content ends
)
func ReadHexData(src []byte) ([]byte, error) {
lines := strings.Split(string(src), "\n")
var data string
for _, line := range lines {
for i := beginOffset; i < endLength; i++ {
if line[i] != ' ' {
data = data + string(line[i])
}
}
}
return hex.DecodeString(data)
}

Chunk size appears on Browser page

I'm implementing a small web server into a wifi micro. To aid in development and test, I have ported it to Windows console program.
I use chunked transfer processing. The following is what shows up on the browser:
0059
Hello World
0
The 59 is the hex size of the chunk and the 0 is the chunked terminating size
This is the data captured via wireshark:
This is the first message I send which are the headers
0000 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d HTTP/1.1 200 OK.
0010 0a 53 65 72 76 65 72 3a 20 54 72 61 6e 73 66 65 .Server: Transfe
0020 72 2d 45 6e 63 6f 64 69 6e 67 3a 20 63 68 75 6e r-Encoding: chun
0030 6b 65 64 0d 0a 43 6f 6e 74 65 6e 74 2d 54 79 70 ked..Content-Typ
0040 65 3a 20 74 65 78 74 2f 68 74 6d 6c 0d 0a 43 61 e: text/html..Ca
0050 63 68 65 2d 43 6f 6e 74 72 6f 6c 3a 20 6d 61 78 che-Control: max
0060 2d 61 67 65 3d 33 36 30 30 2c 20 6d 75 73 74 2d -age=3600, must-
0070 72 65 76 61 6c 69 64 61 74 65 0d 0a 0d 0a revalidate....
The next block is the chunked data
0000 30 30 35 39 0d 0a 3c 68 74 6d 6c 3e 0a 3c 68 65 0059..<html>.<he
0010 61 64 3e 3c 74 69 74 6c 65 3e 57 65 62 20 53 65 ad><title>Web Se
0020 72 76 65 72 3c 2f 74 69 74 6c 65 3e 0a 3c 2f 68 rver</title>.</h
0030 65 61 64 3e 0a 3c 62 6f 64 79 3e 0a 3c 68 31 3e ead>.<body>.<h1>
0040 48 65 6c 6c 6f 20 57 6f 72 6c 64 3c 2f 68 31 3e Hello World</h1>
0050 0a 3c 2f 62 6f 64 79 3e 3c 2f 68 74 6d 6c 3e 0d .</body></html>.
0060 0a 30 0d 0a 0d 0a .0....
The chunked values are being displayed on both Chrome and IE.
Can anyone see an issue with my data that would cause the issue.
Thanks
Solved:
I mistakenly remove the server name so now the browser is taking the transfer encoding as the server name and does not understand the chunked message size -- it thinks its just data to display.

Reformat xattr output and store it in MySQL using a BASH script

I have a script that collects a bunch of file system object information (hashes, dates, etc) and stores it in a MySQL database (one row per object).
The script is running in Bash in Mac OS X 10.10.4 (MBP).
I would like to store the HFS+ Extended Attributes in the database as well. xattr gives output as shown below, I would like to dump the hex and formatting text leaving just the attribute name and the ASCII value. This means not just dumping the line numbers, hex, and | formatting characters but also concatenate the value onto one line per attribute name with the attribute name prepended.
Note that each object (file/folder) may have multiple attributes and the attribute names are not defined.
Take this input:
$xattr -l wordpress-3.9.6.zip
com.apple.metadata:kMDItemWhereFroms:
00000000 62 70 6C 69 73 74 30 30 A2 01 02 5F 10 29 68 74 |bplist00..._.)ht|
00000010 74 70 73 3A 2F 2F 77 6F 72 64 70 72 65 73 73 2E |tps://wordpress.|
00000020 6F 72 67 2F 77 6F 72 64 70 72 65 73 73 2D 33 2E |org/wordpress-3.|
00000030 39 2E 36 2E 7A 69 70 5F 10 2F 68 74 74 70 73 3A |9.6.zip_./https:|
00000040 2F 2F 77 6F 72 64 70 72 65 73 73 2E 6F 72 67 2F |//wordpress.org/|
00000050 64 6F 77 6E 6C 6F 61 64 2F 72 65 6C 65 61 73 65 |download/release|
00000060 2D 61 72 63 68 69 76 65 2F 08 0B 37 00 00 00 00 |-archive/..7....|
00000070 00 00 01 01 00 00 00 00 00 00 00 03 00 00 00 00 |................|
00000080 00 00 00 00 00 00 00 00 00 00 00 69 |...........i|
0000008c
com.apple.quarantine: 0001;55701556;Google Chrome.app;8AD80928-CB48-48EA-8A1B-EC4B0BE656A9
And make it look like this:
com.apple.metadata:kMDItemWhereFroms: bplist00..._.)https://wordpress.org/wordpress-3.9.6.zip_./https://wordpress.org/download/release-archive/..7...............................i
com.apple.quarantine: 0001;55701556;Google Chrome.app;8AD80928-CB48-48EA-8A1B-EC4B0BE656A9
Thanks for any help
MC
xattr is not very customizable; it's meant more for human browsing than scripted use. You're better off using another language. Here's an example in Python:
import xattr
x = xattr.xattr('wordpress-3.9.6.zip')
for name, value in x:
print name, repr(x[name])
You may want to drop the call to repr (or use a different wrapper around x[name]), depending on the desired output.
Note that you almost certainly do not want the . from the ASCII output of the xattr program, since they represent any non-printable ASCII character.

Printing string representations of xattr hex output

I'm trying to write a script to extract the original download URL from disk images downloaded with Safari on OS X using xattr, so that I can rename them but still easily obtain their original names for reference.
This command prints the hex representation of the URL that the given file was downloaded from, as an example:
xattr -p com.apple.metadata:kMDItemWhereFroms *.dmg
gives
62 70 6C 69 73 74 30 30 A1 01 5F 10 4F 68 74 74
70 3A 2F 2F 61 64 63 64 6F 77 6E 6C 6F 61 64 2E
61 70 70 6C 65 2E 63 6F 6D 2F 4D 61 63 5F 4F 53
5F 58 2F 6D 61 63 5F 6F 73 5F 78 5F 31 30 2E 36
2E 31 5F 62 75 69 6C 64 5F 31 30 62 35 30 34 2F
30 34 31 35 30 37 33 61 2E 64 6D 67 08 0A 00 00
00 00 00 00 01 01 00 00 00 00 00 00 00 02 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 5C
The URL starts at the 14th byte (if I counted correctly) and is NULL terminated. How can I format this string so that I get a string output as follows:
http://adcdownload.apple.com/Mac_OS_X/mac_os_x_10.6.1_build_10b504/0415073a.dmg
(don't worry, this link doesn't work unless you're logged in to ADC)
...essentially, the same thing Finder will display in Get Info. I tried piping xattr's output to xxd but I'm not sure how to specify the offset so the string starts at the right place.
So, after looking at the binary data returned by xattr -p, I realized that it was actually a binary plist... hence "bplist" at the front of the data. For some reason I didn't notice this before, but in light of this, here's a proper solution that should work on every OS X from 10.5 to 10.8.
To avoid duplication, I'll link to the source instead of pasting it: https://github.com/jakepetroules/wherefrom

Resources