how to deep recursive parent child relationship in java - spring-boot

This data set looks like in the following query result, which is the ouput of a very simple TABLE dndclasses; SQL command:
id │ parent_id │ name
════╪═══════════╪═══════════════════════════
1 │ ¤ │ Warrior
2 │ ¤ │ Wizard
3 │ ¤ │ Priest
4 │ ¤ │ Rogue
5 │ 1 │ Fighter
6 │ 1 │ Paladin
7 │ 1 │ Ranger
8 │ 2 │ Mage
9 │ 2 │ Specialist wizard
10 │ 3 │ Cleric
11 │ 3 │ Druid
12 │ 3 │ Priest of specific mythos
13 │ 4 │ Thief
14 │ 4 │ Bard
15 │ 13 │ Assassin
(15 rows)
output : The result we need now should look something like this:
[
{
"Name": "Wizard",
"Sub Classes": ["Mage", "Specialist wizard"]
}
{
"Name": "Priest",
"Sub Classes": ["Cleric", "Druid", "Priest of specific mythos"]
}
{
"Name": "Warrior",
"Sub Classes": ["Fighter", "Paladin", "Ranger"]
}
{
"Name": "Rogue",
"Sub Classes": ["Thief", "Bard"]
}
{
"Name": "Rogue",
"Sub Classes": {
"Name": "Thief", "Sub Classes": ["Assassin"]
}
}
]

Related

How convert a newman response string to a json response in bash [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 19 days ago.
Improve this question
I am looking to convert a table formatted string which is a newman response to json format. Below is the string
┌─────────────────────────┬────────────────────┬───────────────────┐
│ │ executed │ failed │
├─────────────────────────┼────────────────────┼───────────────────┤
│ iterations │ 1 │ 0 │
├─────────────────────────┼────────────────────┼───────────────────┤
│ requests │ 30 │ 0 │
├─────────────────────────┼────────────────────┼───────────────────┤
│ test-scripts │ 60 │ 0 │
├─────────────────────────┼────────────────────┼───────────────────┤
│ prerequest-scripts │ 38 │ 0 │
├─────────────────────────┼────────────────────┼───────────────────┤
│ assertions │ 56 │ 0 │
├─────────────────────────┴────────────────────┴───────────────────┤
│ total run duration: 8.9s │
├──────────────────────────────────────────────────────────────────┤
│ total data received: 8.19kB (approx) │
├──────────────────────────────────────────────────────────────────┤
│ average response time: 267ms │
└──────────────────────────────────────────────────────────────────┘
I am trying to convert the above to a json response as below
{
"iterations": {
"executed" : 1,
"failed" : 0
},
"requests":{
"executed" : 30,
"failed" : 0
},
"test-scripts":{
"executed" : 60,
"failed" : 0
},
"prerequest-scripts":{
"executed" : 38,
"failed" : 0
},
"assertions":{
"executed" : 60,
"failed" : 0
},
"total run duration": "8.9s"
"total data received": "8.19kB (approx)"
"average response time": "267ms"
}
Can someone help me on this please.
Given that exact string:
$ echo "$s"
┌─────────────────────────┬────────────────────┬───────────────────┐
│ │ executed │ failed │
├─────────────────────────┼────────────────────┼───────────────────┤
│ iterations │ 1 │ 0 │
├─────────────────────────┼────────────────────┼───────────────────┤
│ requests │ 30 │ 0 │
├─────────────────────────┼────────────────────┼───────────────────┤
│ test-scripts │ 60 │ 0 │
├─────────────────────────┼────────────────────┼───────────────────┤
│ prerequest-scripts │ 38 │ 0 │
├─────────────────────────┼────────────────────┼───────────────────┤
│ assertions │ 56 │ 0 │
├─────────────────────────┴────────────────────┴───────────────────┤
│ total run duration: 8.9s │
├──────────────────────────────────────────────────────────────────┤
│ total data received: 8.19kB (approx) │
├──────────────────────────────────────────────────────────────────┤
│ average response time: 267ms │
└──────────────────────────────────────────────────────────────────┘
You can use this ruby:
echo "$s" | ruby -r json -e 's=$<.read
d={}
s.scan(/^\W*([a-zA-Z\-]+)\W*(\d+)\W*(\d+)/).
map{|k,e,f| {"#{k}":{"executed":e, "failed":f}} }.
each{|h| d.merge!(h) }
s.scan(/^(?=.*:)[ \t]*│[ \t]*([^:]*):[ \t]*(.*?)(?:[ \t]{2,}|│)/).
map{|k,v| {"#{k}":v}}.
each{|h| d.merge!(h) }
puts JSON.pretty_generate(d)'
Prints:
{
"iterations": {
"executed": "1",
"failed": "0"
},
"requests": {
"executed": "30",
"failed": "0"
},
"test-scripts": {
"executed": "60",
"failed": "0"
},
"prerequest-scripts": {
"executed": "38",
"failed": "0"
},
"assertions": {
"executed": "56",
"failed": "0"
},
"total run duration": "8.9s",
"total data received": "8.19kB (approx)",
"average response time": "267ms"
}

Spring Scheduler wont stop after loop but run for 1 minute before stop and re trigger

#Slf4j
#Component
public class ABScheduler {
#Autowired
private ABService abService;
#Scheduled(cron="* */5 * * * *")
private void testCron(){
String hour = "01";
List<Object1> object1List = new ArrayList<>(abService.getListOfobject(hour));
for(Object1 object1: object1List ){
System.out.println("object " +object1);
}
log.info("Cron finish run at "+ new Timestamp(System.currentTimeMillis()) );
}
}
I'm new to this scheduler spring and I found trouble to make it stop after the for loop.
After getting list of object in object1List(ie; 3 object), it enter the loop 3 time(correct). But instead stop the cron after looping 3 times, it keep looping again for duration of 1 minute..
What I need to do to make sure the cron run every 5 minutes but stop after finish it task not keep looping the same task for 1 minute
You should add #EnableScheduling anotation in the project.
Your cron runs every second. (your cron second value = *)
If you just want to work only once in five minutes, try that: #Scheduled(cron="0 */5 * * * *")
The cron format:
┌───────────── second (0-59)
│ ┌───────────── minute (0 - 59)
│ │ ┌───────────── hour (0 - 23)
│ │ │ ┌───────────── day of the month (1 - 31)
│ │ │ │ ┌───────────── month (1 - 12) (or JAN-DEC)
│ │ │ │ │ ┌───────────── day of the week (0 - 7)
│ │ │ │ │ │ (0 or 7 is Sunday, or MON-SUN)
│ │ │ │ │ │
* * * * * *
Resource: Spring Framework CronExpression

Why does mergeMap in RxJS remember history when it's not a ReplaySubject

Two buttons on screen with clicks plumbed in like this.
b1c: number;
b1s: Subject<number>;
o1: Observable<number>;
b2c: number;
b2s: Subject<number>;
o2: Observable<number>;
rxjs(): void {
this.b1c = 0;
this.b1s = new Subject<number>()
this.b2c = 0;
this.b2s = new Subject<number>()
this.b1s.pipe(
tap (z => console.log('do', z)),
mergeMap(z1 => this.b2s.pipe(
map(z2 => `${z1} / ${z2}`)
))
).subscribe(z => console.log(z));
// When you click button 1 nothing happens.
// When you click button 2 you get output i/j for all values of i from 0 to current, and the current value of j.
// This is incorrect because this.b1s is not a ReplaySubject and therefore should not remember the previous values.
}
buttonClick(buttonNumber: number) {
if (buttonNumber === 1) {
this.b1s.next(this.b1c);
this.b1c++;
}
else {
this.b2s.next(this.b2c);
this.b2c++;
}
}
Can you explain this behaviour? (Button clicks: 1, 1, 1, 1, 2, 1, 1, 1, 2.)
It should be only
5 / 1
6 / 1
7 / 1
because 0 to 4 have already been consumed; there is no reason why they should be remembered.
In fact maybe there should be no output at all because at no time do the two observables simultaneously fire -- you can't click two buttons at once.
How do you explain this behaviour, and how is it possible to deduce this from the documentation?
Furthermore: I don't understand why there are three 30s and three 50s and why they're mixed up. There should be six outputs becuase there are six events.
https://rxjs-dev.firebaseapp.com/api/operators/flatMap
And we still have this business about it remembering the last value from the other source.
And what on earth is
10*i--10*i--10*i-|
supposed to mean?
What you understand as 'remembering history' are just active subscriptions.
obs1.pipe(mergeMap(v => obs2)) maps each value from the source (obs1) to a stream (obs2). So you end up with multiple obs2 streams having an active subscription at the same time. These obs2 streams are merged in one output stream and run until they complete, error or you unsubscribe.
In your case obs2 = b2s is a Subject which is a hot observable that doesn't terminate on its own. The final observables behaves like this:
this.b1s.pipe(
mergeMap(z1 => this.b2s.pipe(
map(z2 => `${z1} / ${z2}`)
))
).subscribe(z => console.log(z));
b1s: --0--1--2--3--4-------5--6--7--------
b2s-7: │ │ │ │ │ │ │ └--7/1---
b2s-6: │ │ │ │ │ │ └-----6/1---
b2s-5: │ │ │ │ │ └--------5/1---
b2s-4: │ │ │ │ └--4/0-----------4/1---
b2s-3: │ │ │ └-----3/0-----------3/1---
b2s-2: │ │ └--------2/0-----------2/1---
b2s-1: │ └-----------1/0-----------1/1---
b2s-0: └--------------0/0-----------0/1---
output: -----------------0/0-----------0/1---
1/0 1/1
2/0 2/1
3/0 3/1
4/0 4/1
5/1
6/1
7/1
The subscriptions to b2s-0 - b2s-4 are still active when b2s emits 1.
If you don't want the inner streams to run indefinitely you have to terminate them somehow. You could use take(1) if you only want them to emit one value.
this.b1s.pipe(
mergeMap(z1 => this.b2s.pipe(
take(1), // <-- only emit 1 value and then complete
map(z2 => `${z1} / ${z2}`)
))
).subscribe(z => console.log(z));
b1s: --0--1--2--3--4-------5--6--7--------
b2s-7: │ │ │ │ │ │ │ └--7/1|
b2s-6: │ │ │ │ │ │ └-----6/1|
b2s-5: │ │ │ │ │ └--------5/1|
b2s-4: │ │ │ │ └--4/0|
b2s-3: │ │ │ └-----3/0|
b2s-2: │ │ └--------2/0|
b2s-1: │ └-----------1/0|
b2s-0: └--------------0/0|
output: -----------------0/0-----------5/1---
1/0 6/1
2/0 7/1
3/0
4/0
In the documentation obs2 = 10----10----10-| is a cold observable that terminates after 3 emissions. It also generates the same 3 emissions for every subscriber (unlike your Subject).
ob1: --1-----------------------------3--------------5----------------------------|
obs2-3: │ │ └10*5------10*5------10*5-|
obs2-2: │ └10*3------10*3------10*3-|
obs2-1: └10*1------10*1------10*1-|
output: ---10*1------10*1------10*1------10*3------10*3-10*5-10*3-10*5------10*5----|
= --- 10 ------ 10 ------ 10 ------ 30 ------ 30 - 50 - 30 - 50 ------ 50 ----|

How to get value of a FLAG in ClickHouse

ClickHouse has flags for some options. For example, allow_experimental_live_view can be set to 1 with:
:) SET allow_experimental_live_view = 1
But there is no GET to get the value of the flag, something like GET allow_experimental_live_view.
Is there any way to get the value of a flag?
Use system.settings-table to get the current status:
SELECT *
FROM system.settings
WHERE name = 'allow_experimental_live_view'
/*
┌─name─────────────────────────┬─value─┬─changed─┬─description──────────────────────────┬─min──┬─max──┬─readonly─┐
│ allow_experimental_live_view │ 0 │ 1 │ Enable LIVE VIEW. Not mature enough. │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 0 │
└──────────────────────────────┴───────┴─────────┴──────────────────────────────────────┴──────┴──────┴──────────┘
*/
With the new versions, there is also getSetting function for this.
SELECT getSetting('allow_experimental_live_view')
/*
┌─getSetting('allow_experimental_live_view')─┐
│ 0 │
└────────────────────────────────────────────┘
*/

Inspect Parquet from command line

How do I inspect the content of a Parquet file from the command line?
The only option I see now is
$ hadoop fs -get my-path local-file
$ parquet-tools head local-file | less
I would like to
avoid creating the local-file and
view the file content as json rather than the typeless text that parquet-tools prints.
Is there an easy way?
You can use parquet-tools with the command cat and the --json option in order to view the files without a local copy and in the JSON format.
Here is an example:
parquet-tools cat --json hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet
This prints out the data in JSON format:
{"name":"gil","age":48,"city":"london"}
{"name":"jane","age":30,"city":"new york"}
{"name":"jordan","age":18,"city":"toronto"}
Disclaimer: this was tested in Cloudera CDH 5.12.0
Install homebrew on your Mac (see https://brew.sh/) and then just:
brew install parquet-tools
Once you have done that you can user parquet-tools binary (which should now be in your path) at your command line for various commands.
parquet-tools or parquet-tools -h will give you usage info.
Examples:
> parquet-tools rowcount part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet
Total RowCount: 148192
> parquet-tools head -n 1 part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet
:created_at = 2019-02-28T00:16:06.329Z
:id = row-wive~i58u-qaeu
:updated_at = 2019-02-28T00:16:06.329Z
agency = 1
body_style = PA
color = GY
fine_amount = 63
issue_date = 17932
issue_time = 1950
latitude = 64379050
location = 12743 DAVENTRY
longitude = 19261609
make = HYDA
marked_time =
meter_id =
plate_expiry_date = 18048
route = 16X2
rp_state_plate = CA
ticket_number = 1020798376
vin =
violation_code = 22502A#
violation_description = 18 IN. CURB/2 WAY
> parquet-tools meta part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet
file: file:/Users/matthewropp/team_demo/los-angeles-parking-citations/raw_citations/issue_month=201902/part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet
creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":":created_at","type":"string","nullable":true,"metadata":{}},{"name":":id","type":"string","nullable":true,"metadata":{}},{"name":":updated_at","type":"string","nullable":true,"metadata":{}},{"name":"agency","type":"integer","nullable":true,"metadata":{}},{"name":"body_style","type":"string","nullable":true,"metadata":{}},{"name":"color","type":"string","nullable":true,"metadata":{}},{"name":"fine_amount","type":"integer","nullable":true,"metadata":{}},{"name":"issue_date","type":"date","nullable":true,"metadata":{}},{"name":"issue_time","type":"integer","nullable":true,"metadata":{}},{"name":"latitude","type":"decimal(8,1)","nullable":true,"metadata":{}},{"name":"location","type":"string","nullable":true,"metadata":{}},{"name":"longitude","type":"decimal(8,1)","nullable":true,"metadata":{}},{"name":"make","type":"string","nullable":true,"metadata":{}},{"name":"marked_time","type":"string","nullable":true,"metadata":{}},{"name":"meter_id","type":"string","nullable":true,"metadata":{}},{"name":"plate_expiry_date","type":"date","nullable":true,"metadata":{}},{"name":"route","type":"string","nullable":true,"metadata":{}},{"name":"rp_state_plate","type":"string","nullable":true,"metadata":{}},{"name":"ticket_number","type":"string","nullable":false,"metadata":{}},{"name":"vin","type":"string","nullable":true,"metadata":{}},{"name":"violation_code","type":"string","nullable":true,"metadata":{}},{"name":"violation_description","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
: created_at: OPTIONAL BINARY O:UTF8 R:0 D:1
: id: OPTIONAL BINARY O:UTF8 R:0 D:1
: updated_at: OPTIONAL BINARY O:UTF8 R:0 D:1
agency: OPTIONAL INT32 R:0 D:1
body_style: OPTIONAL BINARY O:UTF8 R:0 D:1
color: OPTIONAL BINARY O:UTF8 R:0 D:1
fine_amount: OPTIONAL INT32 R:0 D:1
issue_date: OPTIONAL INT32 O:DATE R:0 D:1
issue_time: OPTIONAL INT32 R:0 D:1
latitude: OPTIONAL INT32 O:DECIMAL R:0 D:1
location: OPTIONAL BINARY O:UTF8 R:0 D:1
longitude: OPTIONAL INT32 O:DECIMAL R:0 D:1
make: OPTIONAL BINARY O:UTF8 R:0 D:1
marked_time: OPTIONAL BINARY O:UTF8 R:0 D:1
meter_id: OPTIONAL BINARY O:UTF8 R:0 D:1
plate_expiry_date: OPTIONAL INT32 O:DATE R:0 D:1
route: OPTIONAL BINARY O:UTF8 R:0 D:1
rp_state_plate: OPTIONAL BINARY O:UTF8 R:0 D:1
ticket_number: REQUIRED BINARY O:UTF8 R:0 D:0
vin: OPTIONAL BINARY O:UTF8 R:0 D:1
violation_code: OPTIONAL BINARY O:UTF8 R:0 D:1
violation_description: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:148192 TS:10503944 OFFSET:4
--------------------------------------------------------------------------------
: created_at: BINARY SNAPPY DO:0 FPO:4 SZ:607/616/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-28T00:16:06.329Z, max: 2019-03-02T00:20:00.249Z, num_nulls: 0]
: id: BINARY SNAPPY DO:0 FPO:611 SZ:2365472/3260525/1.38 VC:148192 ENC:BIT_PACKED,PLAIN,RLE ST:[min: row-2229_y75z.ftdu, max: row-zzzs_4hta.8fub, num_nulls: 0]
: updated_at: BINARY SNAPPY DO:0 FPO:2366083 SZ:602/611/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-28T00:16:06.329Z, max: 2019-03-02T00:20:00.249Z, num_nulls: 0]
agency: INT32 SNAPPY DO:0 FPO:2366685 SZ:4871/5267/1.08 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 1, max: 58, num_nulls: 0]
body_style: BINARY SNAPPY DO:0 FPO:2371556 SZ:36244/61827/1.71 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: WR, num_nulls: 0]
color: BINARY SNAPPY DO:0 FPO:2407800 SZ:111267/111708/1.00 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YL, num_nulls: 0]
fine_amount: INT32 SNAPPY DO:0 FPO:2519067 SZ:71989/82138/1.14 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 25, max: 363, num_nulls: 63]
issue_date: INT32 SNAPPY DO:0 FPO:2591056 SZ:20872/23185/1.11 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-01, max: 2019-02-27, num_nulls: 0]
issue_time: INT32 SNAPPY DO:0 FPO:2611928 SZ:210026/210013/1.00 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 1, max: 2359, num_nulls: 41]
latitude: INT32 SNAPPY DO:0 FPO:2821954 SZ:508049/512228/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 99999.0, max: 6513161.2, num_nulls: 0]
location: BINARY SNAPPY DO:0 FPO:3330003 SZ:1251364/2693435/2.15 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,PLAIN,RLE ST:[min: , max: ZOMBAR/VALERIO, num_nulls: 0]
longitude: INT32 SNAPPY DO:0 FPO:4581367 SZ:516233/520692/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 99999.0, max: 1941557.4, num_nulls: 0]
make: BINARY SNAPPY DO:0 FPO:5097600 SZ:147034/150364/1.02 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YAMA, num_nulls: 0]
marked_time: BINARY SNAPPY DO:0 FPO:5244634 SZ:11675/17658/1.51 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: 959.0, num_nulls: 0]
meter_id: BINARY SNAPPY DO:0 FPO:5256309 SZ:172432/256692/1.49 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YO97, num_nulls: 0]
plate_expiry_date: INT32 SNAPPY DO:0 FPO:5428741 SZ:149849/152288/1.02 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2000-02-01, max: 2099-12-01, num_nulls: 18624]
route: BINARY SNAPPY DO:0 FPO:5578590 SZ:38377/45948/1.20 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: WTD, num_nulls: 0]
rp_state_plate: BINARY SNAPPY DO:0 FPO:5616967 SZ:33281/60186/1.81 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: AB, max: XX, num_nulls: 0]
ticket_number: BINARY SNAPPY DO:0 FPO:5650248 SZ:801039/2074791/2.59 VC:148192 ENC:BIT_PACKED,PLAIN ST:[min: 1020798376, max: 4350802142, num_nulls: 0]
vin: BINARY SNAPPY DO:0 FPO:6451287 SZ:64/60/0.94 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: , num_nulls: 0]
violation_code: BINARY SNAPPY DO:0 FPO:6451351 SZ:94784/131071/1.38 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 000, max: 8942, num_nulls: 0]
violation_description: BINARY SNAPPY DO:0 FPO:6546135 SZ:95937/132641/1.38 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YELLOW ZONE, num_nulls: 0]
> parquet-tools dump -m -c make part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet | head -20
BINARY make
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 148192 ***
value 1: R:0 D:1 V:HYDA
value 2: R:0 D:1 V:NISS
value 3: R:0 D:1 V:NISS
value 4: R:0 D:1 V:TOYO
value 5: R:0 D:1 V:AUDI
value 6: R:0 D:1 V:MERC
value 7: R:0 D:1 V:LEX
value 8: R:0 D:1 V:BMW
value 9: R:0 D:1 V:GMC
value 10: R:0 D:1 V:HOND
value 11: R:0 D:1 V:TOYO
value 12: R:0 D:1 V:NISS
value 13: R:0 D:1 V:
value 14: R:0 D:1 V:THOR
value 15: R:0 D:1 V:DODG
value 16: R:0 D:1 V:DODG
value 17: R:0 D:1 V:HOND
I've found this program really useful:
https://github.com/chhantyal/parquet-cli
Lets you view parquet files without having the whole infrastructure installed.
Just type:
pip install parquet-cli
parq input.parquet --head 10
I recommend just building and running the parquet-tools.jar for your Hadoop distribution.
Checkout the github project: https://github.com/apache/parquet-mr/tree/master/parquet-tools
hadoop jar ./parquet-tools-<VERSION>.jar <command>.
By default parquet-tools in general will look for the local file directory, so to point it to hdfs, we need to add hdfs:// in the beginning of the file path. So in your case, you can do something like this
parquet-tools head hdfs://localhost/<hdfs-path> | less
I had the same issue and it worked fine for me. There is no need to download the file locally first.
DuckDB has CLI tool (prebuilt binaries for linux, windows, macOS) that can be used to query parquet data from command line.
PS C:\Users\nsuser\dev\standalone_executable_binaries> ./duckdb
v0.5.1 7c111322d
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
Read parquet data using SQL queries
D SELECT * FROM READ_PARQUET('C:\Users\nsuser\dev\sample_files\userdata1.parquet') limit 3;
┌─────────────────────┬────┬────────────┬───────────┬─────────────────────────┬────────┬────────────────┬──────────────────┬───────────┬───────────┬───────────┬─────────────────────┬──────────┐
│ registration_dttm │ id │ first_name │ last_name │ email │ gender │ ip_address │ cc │ country │ birthdate │ salary │ title │ comments │
├─────────────────────┼────┼────────────┼───────────┼─────────────────────────┼────────┼────────────────┼──────────────────┼───────────┼───────────┼───────────┼─────────────────────┼──────────┤
│ 2016-02-03 07:55:29 │ 1 │ Amanda │ Jordan │ ajordan0#com.com │ Female │ 1.197.201.2 │ 6759521864920116 │ Indonesia │ 3/8/1971 │ 49756.53 │ Internal Auditor │ 1E+02 │
│ 2016-02-03 17:04:03 │ 2 │ Albert │ Freeman │ afreeman1#is.gd │ Male │ 218.111.175.34 │ │ Canada │ 1/16/1968 │ 150280.17 │ Accountant IV │ │
│ 2016-02-03 01:09:31 │ 3 │ Evelyn │ Morgan │ emorgan2#altervista.org │ Female │ 7.161.136.94 │ 6767119071901597 │ Russia │ 2/1/1960 │ 144972.51 │ Structural Engineer │ │
└─────────────────────┴────┴────────────┴───────────┴─────────────────────────┴────────┴────────────────┴──────────────────┴───────────┴───────────┴───────────┴─────────────────────┴──────────┘
Read Parquet Schema.
D DESCRIBE SELECT * FROM READ_PARQUET('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
OR
D SELECT * FROM PARQUET_SCHEMA('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
┌───────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
├───────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ registration_dttm │ TIMESTAMP │ YES │ │ │ │
│ id │ INTEGER │ YES │ │ │ │
│ first_name │ VARCHAR │ YES │ │ │ │
│ birthdate │ VARCHAR │ YES │ │ │ │
│ salary │ DOUBLE │ YES │ │ │ │
└───────────────────┴─────────────┴──────┴─────┴─────────┴───────┘
Read Parquet metadata and statistics.
D SELECT row_group_id, row_group_num_rows, compression, stats_min, stats_max, stats_null_count FROM PARQUET_METADATA('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
┌──────────────┬────────────────────┬──────────────┬─────────────────────┬─────────────────────┬──────────────────┐
│ row_group_id │ row_group_num_rows │ compression │ stats_min │ stats_max │ stats_null_count │
├──────────────┼────────────────────┼──────────────┼─────────────────────┼─────────────────────┼──────────────────┤
│ 0 │ 1000 │ UNCOMPRESSED │ 2016-02-03 22:59:12 │ 2016-02-03 20:51:31 │ 0 │
│ 0 │ 1000 │ UNCOMPRESSED │ 1 │ 1000 │ 0 │
│ 0 │ 1000 │ UNCOMPRESSED │ "Bonaire │ Zimbabwe │ 0 │
│ 0 │ 1000 │ UNCOMPRESSED │ │ 9/9/1981 │ 0 │
│ 0 │ 1000 │ UNCOMPRESSED │ 12380.49 │ 286592.99 │ 68 │
└──────────────┴────────────────────┴──────────────┴─────────────────────┴─────────────────────┴──────────────────┘
Alternative:
parquet-cli is a light weight python alternative.
pip install parquet-cli //installs via pip
parq filename.parquet //view meta data
parq filename.parquet --schema //view the schema
parq filename.parquet --head 10 //view top n rows
If you're using HDFS, the following commands are very useful as they are frequently used (left here for future reference):
hadoop jar parquet-tools-1.9.0.jar schema hdfs://path/to/file.snappy.parquet
hadoop jar parquet-tools-1.9.0.jar head -n5 hdfs://path/to/file.snappy.parquet
Actually, I find out that pandas has already supported parquet files, as long as you've installed pyarrow or fastparquet as its backend. Check out read_parquet:
import pandas as pd
df = pd.read_parquet('your-file.parquet')
df.head(10)
...
Previous answer:
Might be late for the party, but I just learnt that pyarrow supports reading parquet already, and it's quite powerful. Chances are that you already have pyarrow and pandas installed, so you can read parquet just like this
from pyarrow import parquet
import pandas
p = parquet.read_table('/path/to/your/xxxxx.parquet')
df = p.to_pandas()
df.head(10)
...
If you use Docker you can also do something like this:
docker run -ti -v C:\file.parquet:/tmp/file.parquet nathanhowell/parquet-tools cat /tmp/file.parquet
On Windows 10 x64, try Parq:
choco install parq
This installs everything into the current directory. You will have to add this directory manually to the path, or run parq.exe from within this directory.
My other answer builds parquet-reader from source. This utility looks like it does much the same job.
I'd rather use hdfs NFS Gateway + autofs for easy hdfs file investigation.
My setup:
HDFS NFS Gateway service running on namenode.
distribution bundled autofs service on. with following configuration change made to auto.master
/net -hosts nobind
I can easily run following command to investigate any hdfs file
head /net/<namenodeIP>/path/to/hdfs/file
parquet-tools head /net/<namenodeIP>/path/to/hdfs/par-file
rsync -rv /local/directory/ /net/<namenodeIP>/path/to/hdfs/parentdir/
forget about the hadoop* hdfs* command ;)
On Windows 10 x64 I ended up building parquet-reader just now from source:
Windows 10 + WSL + GCC
Installed WSL with Ubuntu LTS 18.04. Upgraded gcc to v9.2.1 and CMake to latest. Bonus: install Windows Terminal.
git checkout https://github.com/apache/arrow
cd arrow
cd cpp
mkdir buildgcc
cd buildgcc
cmake .. -DPARQUET_BUILD_EXECUTABLES=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_BROTLI=ON -DPARQUET_BUILD_EXAMPLES=ON -DARROW_CSV=ON
make -j 20
cd release
./parquet-reader
Usage: parquet-reader [--only-metadata] [--no-memory-map] [--json] [--dump] [--print-key-value-metadata] [--columns=...] <file>
If it has trouble building, may have to use vcpkg for the missing libraries.
Also see a another solution that offers less, but in a simpler way: https://github.com/chhantyal/parquet-cli
Linked from: How can I write streaming/row-oriented data using parquet-cpp without buffering?
Initially tried brew install parquet-tools, but this did not appear to work under my install of WSL
Windows 10 + MSVC
Same as above. Use CMake to generate the Visual Studio 2019 project, then build.
git checkout https://github.com/apache/arrow
cd arrow
cd cpp
mkdir buildmsvc
cd buildmsvc
cmake .. -DPARQUET_BUILD_EXECUTABLES=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_BROTLI=ON -DPARQUET_BUILD_EXAMPLES=ON -DARROW_CSV=ON
# Then open the generated .sln file in MSVC and build. Everything should build perfectly.
Troubleshooting:
In case there was any missing libraries, I pointed it at my install of vcpkg. I ran vcpkg integrate install, then copied the to the end of the CMake line:
-DCMAKE_TOOLCHAIN_FILE=[...path...]/vcpkg/scripts/buildsystems
If it had complained about any missing libraries, I would have installed these, e.g. boost, etc using commands like vcpkg install boost:x64.
In case anyone else comes to this looking for an easy way to inspect a parquet file from the command line, I wrote the tool clidb to do this.
It doesn’t generate json like the OP wanted but instead shows the parquet data as a table and allows SQL snippets to be run against it. It should work with:
pip install "clidb[extras]"
clidb path/with/data

Resources