Can someone explain me the output of orcfiledump? - hadoop

My table test_orc contains (for one partition):
col1 col2 part1
abc def 1
ghi jkl 1
mno pqr 1
koi hai 1
jo pgl 1
hai tre 1
By running
hive --orcfiledump /hive/user.db/test_orc/part1=1/000000_0
I get the following:
Structure for /hive/a0m01lf.db/test_orc/part1=1/000000_0 .
2018-02-18 22:10:24 INFO: org.apache.hadoop.hive.ql.io.orc.ReaderImpl - Reading ORC rows from /hive/a0m01lf.db/test_orc/part1=1/000000_0 with {include: null, offset: 0, length: 9223372036854775807} .
Rows: 6 .
Compression: ZLIB .
Compression size: 262144 .
Type: struct<_col0:string,_col1:string> .
Stripe Statistics:
Stripe 1:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .
File Statistics:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .
Stripes:
Stripe: offset: 3 data: 58 rows: 6 tail: 49 index: 67 .
Stream: column 0 section ROW_INDEX start: 3 length 9 .
Stream: column 1 section ROW_INDEX start: 12 length 29 .
Stream: column 2 section ROW_INDEX start: 41 length 29 .
Stream: column 1 section DATA start: 70 length 20 .
Stream: column 1 section LENGTH start: 90 length 12 .
Stream: column 2 section DATA start: 102 length 21 .
Stream: column 2 section LENGTH start: 123 length 5 .
Encoding column 0: DIRECT .
Encoding column 1: DIRECT_V2 .
Encoding column 2: DIRECT_V2 .
What does the part about stripes mean?

First, let's see how an ORC file looks like.
Now some keywords used in above image and also in your question!
Stripe - A chunk of data stored in ORC file. Any ORC file is divided into those chunks, called stripes, each sized 250 MB with index data, actual data and some metadata for actual data stored in that stripe.
Compression - The compression codec used to compress the data stored. ZLIB is the default for ORC.
Index Data - includes min and max values for each column and the row positions within each column. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block. Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.
Row data - Actual data. Is used in table scans.
Stripe Footer - The stripe footer contains the encoding of each column and the directory of the streams including their location. To describe each stream, ORC stores the kind of stream, the column id, and the stream’s size in bytes. The details of what is stored in each stream depends on the type and encoding of the column.
Postscript - holds compression parameters and the size of the compressed footer.
File Footer - The file footer contains a list of stripes in the file, the number of rows per stripe, and each column's data type. It also contains column-level aggregates count, min, max, and sum.
Now! Talking about your output from orcfiledump.
First is general information about your file. The name, location, compression codec, compression size etc.
Stripe statistics will list all the stripes in your ORC file and its corresponding information. You can see counts and some statistics about Integer columns like min, max, sum etc.
File statistics is similar to #2. Just for the complete file as opposed to each stripe in #2.
Last part, the Stripe section, talks about each column in your file and corresponding index info for each of it.
Also, you can use various options with orcfiledump to get "desired" results. Follows a handy guide.
// Hive version 0.11 through 0.14:
hive --orcfiledump <location-of-orc-file>
// Hive version 1.1.0 and later:
hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump]
[--backup-path <new-path>] <location-of-orc-file-or-directory>
Follows a quick guide to the options used in the commands above.
Specifying -d in the command will cause it to dump the ORC file data
rather than the metadata (Hive 1.1.0 and later).
Specifying --rowindex with a comma separated list of column ids will
cause it to print row indexes for the specified columns, where 0 is
the top level struct containing all of the columns and 1 is the first
column id (Hive 1.1.0 and later).
Specifying -t in the command will print the timezone id of the
writer.
Specifying -j in the command will print the ORC file metadata in JSON
format. To pretty print the JSON metadata, add -p to the command.
Specifying --recover in the command will recover a corrupted ORC file
generated by Hive streaming.
Specifying --skip-dump along with --recover will perform recovery
without dumping metadata.
Specifying --backup-path with a new-path will let the recovery tool
move corrupted files to the specified backup path (default: /tmp).
is the URI of the ORC file.
is the URI of the ORC file or
directory. From Hive 1.3.0 onward, this URI can be a directory
containing ORC files.
Hope that helps!

Related

JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0) ---While Tuning gpt2.finetune

Hope you all are doing good ,
I am working on fine tuning GPT 2 model to generate Title based on the content ,While working on it ,I have created a simple CSV files containing only the title to train the model , But while inputting this model to GPT 2 for fine tuning I am getting the following ERROR ,
JSONDecodeError Traceback (most recent call last)
in ()
10 steps=1000,
11 save_every=200,
---> 12 sample_every=25) # steps is max number of training steps
13
14 # gpt2.generate(sess)
3 frames
/usr/lib/python3.7/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
336 if s.startswith('\ufeff'):
337 s = s.encode('utf8')[3:].decode('utf8')
--> 338 # raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",
339 # s, 0)
340 else:
JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)
Below is my code for the above :
import gpt_2_simple as gpt2
model_name = "120M" # "355M" for larger model (it's 1.4 GB)
gpt2.download_gpt2(model_name=model_name) # model is saved into current directory under /models/117M/
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
'titles.csv',
model_name=model_name,
steps=1000,
save_every=200,
sample_every=25) # steps is max number of training steps
I have tried all the basic mechanism of handing UTF -8 BOM but did not find any luck ,Hence requesting your help .It would be a great help from you all .
Try changing the model name because i see you input 120M and the gpt2 model is called 124M

How to find number of unique strings in a column followed by position of a given string

I need to do get 2 things from tsv input file:
1- To find how many unique strings are there in a given column where individual values are comma separated. For this I used the below command which gave me unique values.
$awk < input.tsv '{print $5}' | sort | uniq | wc -l
Input file example with header (6 columns) and 10 rows:
$cat hum1003.tsv
p-Value Score Disease-Id Disease-Name Gene-Symbols Entrez-IDs
0.0463 4.6263 OMIM:117000 #117000 CENTRAL CORE DISEASE OF MUSCLE;;CCD;;CCOMINICORE MYOPATHY, MODERATE, WITH HAND INVOLVEMENT, INCLUDED;;MULTICORE MYOPATHY, MODERATE, WITH HAND INVOLVEMENT, INCLUDED;;MULTIMINICORE DISEASE, MODERATE, WITH HAND INVOLVEMENT, INCLUDED;;NEUROMUSCULAR DISEASE, CONGENITAL, WITH UNIFORM TYPE 1 FIBER, INCLUDED;CNMDU1, INCLUDED RYR1 (6261) 6261
0.0463 4.6263 OMIM:611705 MYOPATHY, EARLY-ONSET, WITH FATAL CARDIOMYOPATHY TTN (7273) 7273
0.0513 4.6263 OMIM:609283 PROGRESSIVE EXTERNAL OPHTHALMOPLEGIA WITH MITOCHONDRIAL DNA DELETIONS,AUTOSOMAL DOMINANT, 2 POLG2 (11232), SLC25A4 (291), POLG (5428), RRM2B (50484), C10ORF2 (56652) 11232, 291, 5428, 50484, 56652
0.0539 4.6263 OMIM:605637 #605637 MYOPATHY, PROXIMAL, AND OPHTHALMOPLEGIA; MYPOP;;MYOPATHY WITH CONGENITAL JOINT CONTRACTURES, OPHTHALMOPLEGIA, ANDRIMMED VACUOLES;;INCLUSION BODY MYOPATHY 3, AUTOSOMAL DOMINANT, FORMERLY; IBM3, FORMERLY MYH2 (4620) 4620
0.0577 4.6263 OMIM:609284 NEMALINE MYOPATHY 1 TPM2 (7169), TPM3 (7170) 7169, 7170
0.0707 4.6263 OMIM:608358 #608358 MYOPATHY, MYOSIN STORAGE;;MYOPATHY, HYALINE BODY, AUTOSOMAL DOMINANT MYH7 (4625) 4625
0.0801 4.6263 OMIM:255320 #255320 MINICORE MYOPATHY WITH EXTERNAL OPHTHALMOPLEGIA;;MINICORE MYOPATHY;;MULTICORE MYOPATHY;;MULTIMINICORE MYOPATHY MULTICORE MYOPATHY WITH EXTERNAL OPHTHALMOPLEGIA;;MULTIMINICORE DISEASE WITH EXTERNAL OPHTHALMOPLEGIA RYR1 (6261) 6261
0.0824 4.6263 OMIM:256030 #256030 NEMALINE MYOPATHY 2; NEM2 NEB (4703) 4703
0.0864 4.6263 OMIM:161800 #161800 NEMALINE MYOPATHY 3; NEM3MYOPATHY, ACTIN, CONGENITAL, WITH EXCESS OF THIN MYOFILAMENTS, INCLUDED;;NEMALINE MYOPATHY 3, WITH INTRANUCLEAR RODS, INCLUDED;;MYOPATHY, ACTIN, CONGENITAL, WITH CORES, INCLUDED ACTA1 (58) 58
0.0939 4.6263 OMIM:602771 RIGID SPINE MUSCULAR DYSTROPHY 1 MYH7 (4625), SEPN1 (57190), TTN (7273), ACTA1 (58) 4625, 57190, 7273, 58
So in this case the string is gene name and I want to count unique strings within the entire stretch of 5th column where they are separated by a comma and a space.
2- Next, the order of data is fixed and is arranged as per column 2's score. So, I want to know where is the gene of interest placed in this ranked list within column 5 (Gene-Symbols). And this has to be done after removing duplicates as same genes are being repeated based on other parameters in rest of the columns but it doesn't concern my final output. I only need to focus on ranked list as per column 2. How do I do that? Is there a command I can pipe to above command to get the position of given value?
Expected output:
If I type the command in point 1 then it should give me unique genes in column 5. I have total 18 genes in column 5. But unique values are 14. If gene of interest is TTN, then it's first occurrence was at second position in original ranked list. Hence, expected answer of where my gene of interest is located should be 2.
$14
$2
Thanks

Uncaught Error: Row 29 has 1 columns, but must have 32

I'm trying to get some CSV data into a DataTable in order to create a line graph. The CSV has gone through a function to clean up the data prior to creating a new DataTable, the variable data2Array.
I only want to access 5 of the columns from data2Array. When I try to add a new row with the 5 data values, I get the following exception:
Uncaught Error: Row 29 has 1 columns, but must have 32
This has me confused, because I can see I've only tried to add the one row to data2Table, supplying only the 5 aforementioned values. Here is the log of what the row looks like:
0, Jul 1 2015, 3.379, 6.57, 0
data2Array has 32 columns and 29 rows, but in my DataTable, I've only specified for 5 columns before attempting to add a row. Since I am not adding anything to the original data source, but rather to data2Table, why is the exception mentioning row 29 or needing 32 columns? My new DataTable has been specified with only 5 columns.
Below is the code I am working with currently:
var data2Array = fullDataArray(csvData2);
console.log("data2 clean: " + data2Array);
// request line graph - data2
var data2Table = new google.visualization.DataTable(data2Array);
data2Table.addColumn('number', 'colIndex');
data2Table.addColumn('string', 'colLabel1');
data2Table.addColumn('string', 'colLabel2');
data2Table.addColumn('string', 'colLabel3');
data2Table.addColumn('string', 'colLabel4');
var testArr = [0, data2Array[0][0],data2Array[0][4], data2Array[0][5], data2Array[0][7]];
console.log('test: ' + testArr);
data2Table.addRow([0, data2Array[0][0],data2Array[0][4], data2Array[0][5], data2Array[0][7]]);

Spark SQL performance too slow if the number of rows are 100000

I'm testing Spark performance with very many rows table.
What I did is very simple.
Prepare csv file which has many rows and only 2 data records.
eg, csv file is like as follows:
col000001,col000002,,,,,,,col100000
dtA000001,dtA000002,,,,,,,,dtA100000
dtB000001,dtB000002,,,,,,,,dtB100000
dfdata100000 = sqlContext.read.csv('../datasets/100000c.csv', header='true')
dfdata100000.registerTempTable("tbl100000")
result = sqlContext.sql("select col000001,ol100000 from tbl100000")
Then get 1 row by show(1)
%%time
result.show(1)
File sizes are as follows(very small).
File name shows the number of rows:
$ du -m *c.csv
3 100000c.csv
1 10000c.csv
1 1000c.csv
1 100c.csv
1 20479c.csv
2 40000c.csv
2 60000c.csv
3 80000c.csv
Results are like as follows:
As you can see, the execution time is exponentially increase.
Example result:
+---------+---------+
|col000001|col100000|
+---------+---------+
|dtA000001|dtA100000|
+---------+---------+
only showing top 1 row
CPU times: user 218 ms, sys: 509 ms, total: 727 ms
Wall time: 53min 22s
Question1: Is it an acceptable result? Why is the execution time exponentially increase?
Question2: Is there any other method to do faster?

How to access each row of dataframe in sparkR

I m running R on spark using sparkR . I have created a data frame of csv file.Now I need to access each row as well data in that row.Is there any method to do that??
In SparkR it is not possible to access data in that row. The possible way would be, is to covert the sparkR data frame to R data frame by using,
>R_people <- collect(people)
head(R_people)
## age name
##1 NA Michael
##2 30 Andy
##3 19 Justin
> R_people$age[3]
#19
#By using this you can filter rows in sparkR data frame `people`
> showDF(filter(people, people$age == R_people$age[3]))
## age name
#1 19 Justin

Resources