How to get all the version of hbase row - hadoop

I am trying to do the following command in hbase:
scan 'testLastVersion' {VERSIONS=>8}
And it return only the last version of the row.
Do you know how can I get all the versions of row through command shell and through java code?
Thanks!

I think you are missing the ',' there.. The command should be something like this:
scan 'emp', {VERSIONS=>8}
Even if you are missing the comma, HBase should throw an error:
SyntaxError: (hbase):16: syntax error, unexpected tLCURLY
I tried to simulate a your scenario and got all the results. Please find them below.
hbase(main):010:0> put 'emp', '1', 'personal_data:name', 'Ajay'
0 row(s) in 0.0220 seconds
hbase(main):012:0> put 'emp', '1', 'personal_data:name', 'Vijay'
0 row(s) in 0.0140 seconds
hbase(main):014:0> put 'emp', '1', 'personal_data:name', 'Ceema'
0 row(s) in 0.0070 seconds
hbase(main):017:0> scan 'emp', {VERSIONS=>3}
ROW COLUMN+CELL
1 column=personal_data:name, timestamp=1472651320449, value=Ceema
1 column=personal_data:name, timestamp=1472651313396, value=Vijay
1 column=personal_data:name, timestamp=1472651300718, value=Ajay
1 row(s) in 0.0220 seconds

Related

tmssoftware TTMSFNCGrid slow data loading

Delphi 10.4.2, TTMSFNCGrid ver. 1.0.5.16
I am downloading about 30,000 records from the database into a json object. This takes about 1 minute.
I then try to enter (for a loop) the data into TTMSFNCGrid which has about 30,000 records and 16 columns. The data entry takes 20 minutes ! This is how long it takes to render and populate the grid. How can I speed up this process?
I use something like this
for _i:= 0 to JSON_ARRAY_DANE.Count-1 do
begin
_row:= JSON_ARRAY_DANE.Items[_i] as TJSONObject;
_grid.Cells[0,_i+1]:=_row.GetValue('c1').Value;
_grid.Cells[1,_i+1]:=_row.GetValue('c2').Value;
_grid.Cells[2,_i+1]:=_row.GetValue('c3').Value;
.
.
_grid.Cells[2,_i+1]:=_row.GetValue('c16').Value;
end
Resolved.
Need to add:
_grid.BeginUpdate;
_grid.EndUpdate;
**_grid.BeginUpdate;**
for _i:= 0 to JSON_ARRAY_DANE.Count-1 do
begin
_row:= JSON_ARRAY_DANE.Items[_i] as TJSONObject;
_grid.Cells[0,_i+1]:=_row.GetValue('c1').Value;
_grid.Cells[1,_i+1]:=_row.GetValue('c2').Value;
_grid.Cells[2,_i+1]:=_row.GetValue('c3').Value;
.
.
_grid.Cells[16,_i+1]:=_row.GetValue('c16').Value;
end;
**_grid.EndUpdate;**

Impala substr can't get utf8 character correctly

I am new to ETL and I was assigned with a task on sanitizing some sensitive information before giving the data to a client.
I am using HUE web client with Impala.
What I want to do is:
For example, a column info like '京客隆(三里屯店)', then I need to transform it into something like '京XXX店)' .
My query is:
select '京客隆(三里屯店)', concat(substr('京客隆(三里屯店)', 1, 3), 'XXX', substr('京客隆(三里屯店)', char_length('京客隆(三里屯店)') -6, 6));
But I get gibberish in the output:
'京客隆(三里屯店)' | concat(substr('京客隆(三里屯店)', 1, 3), 'xxx', substr('京客隆(三里屯店)', char_length('京客隆(三里屯店)') - 6, 6))
京客隆(三里屯店) | 京XXX�店�
The problem is that :
select '京客隆(三里屯店)', substr('京客隆(三里屯店)', char_length('京客隆(三里屯店)') -3 , 3);
output: 京客隆(三里屯店) ��
doesn't get the correct characaters. Why is that? I pasted the string in python shell and I can get the correct characters if I only take the last 3 bytes.
It turns out that I misunderstood the function substr.
substr(STRING a, INT start [, INT len]) :
It takes characters starting from (including) INT start. So for example my string '京客隆(三里屯店)' is 27 bytes long in total, and each utf8 char takes 3 bytes here. I need to take the last 3 bytes, which is the ) , then I need to write:
substr('京客隆(三里屯店), 27 - 2 ,3 ) .
It then gets the 25, 26, 27 3 bytes and display the char ) correctly.
Updated:
I was told to use :
SELECT regexp_replace('京客隆(三里屯店)', '(.)(.*)(.{2})', '\\1***\\3');
works like an charm :P.

time data doesn't match format specified

I am trying to convert the string to the type of 'datetime' in python. My data match the format, but still get the
'ValueError: time data 11 11 doesn't match format specified'
I am not sure where does the "11 11" in the error come from.
My code is
train_df['date_captured1'] = pd.to_datetime(train_df['date_captured'], format="%Y-%m-%d %H:%M:%S")
Head of data is
print (train_df.date_captured.head())
0 2011-05-13 23:43:18
1 2012-03-17 03:48:44
2 2014-05-11 11:56:46
3 2013-10-06 02:00:00
4 2011-07-12 13:11:16
Name: date_captured, dtype: object
I tried the following by just selecting the first string and running the code with same datetime format. They all work without problem.
dt=train_df['date_captured']
dt1=dt[0]
date = datetime.datetime.strptime(dt1, "%Y-%m-%d %H:%M:%S")
print(date)
2011-05-13 23:43:18
and
dt1=pd.to_datetime(dt1, format='%Y-%m-%d %H:%M:%S')
print (dt1)
2011-05-13 23:43:18
But why wen I using the same format in pd.to_datetime to convert all the data in the column, it comes up with the error above?
Thank you.
I solved it.
train_df['date_time'] = pd.to_datetime(train_df['date_captured'], errors='coerce')
print (train_df[train_df.date_time.isnull()])
I found in line 100372, the date_captured value is '11 11'
category_id date_captured ... height date_time
100372 10 11 11 ... 747 NaT
So the code with errors='coerce' will replace the invalid parsing with NaN.
Thank you.

Can someone explain me the output of orcfiledump?

My table test_orc contains (for one partition):
col1 col2 part1
abc def 1
ghi jkl 1
mno pqr 1
koi hai 1
jo pgl 1
hai tre 1
By running
hive --orcfiledump /hive/user.db/test_orc/part1=1/000000_0
I get the following:
Structure for /hive/a0m01lf.db/test_orc/part1=1/000000_0 .
2018-02-18 22:10:24 INFO: org.apache.hadoop.hive.ql.io.orc.ReaderImpl - Reading ORC rows from /hive/a0m01lf.db/test_orc/part1=1/000000_0 with {include: null, offset: 0, length: 9223372036854775807} .
Rows: 6 .
Compression: ZLIB .
Compression size: 262144 .
Type: struct<_col0:string,_col1:string> .
Stripe Statistics:
Stripe 1:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .
File Statistics:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .
Stripes:
Stripe: offset: 3 data: 58 rows: 6 tail: 49 index: 67 .
Stream: column 0 section ROW_INDEX start: 3 length 9 .
Stream: column 1 section ROW_INDEX start: 12 length 29 .
Stream: column 2 section ROW_INDEX start: 41 length 29 .
Stream: column 1 section DATA start: 70 length 20 .
Stream: column 1 section LENGTH start: 90 length 12 .
Stream: column 2 section DATA start: 102 length 21 .
Stream: column 2 section LENGTH start: 123 length 5 .
Encoding column 0: DIRECT .
Encoding column 1: DIRECT_V2 .
Encoding column 2: DIRECT_V2 .
What does the part about stripes mean?
First, let's see how an ORC file looks like.
Now some keywords used in above image and also in your question!
Stripe - A chunk of data stored in ORC file. Any ORC file is divided into those chunks, called stripes, each sized 250 MB with index data, actual data and some metadata for actual data stored in that stripe.
Compression - The compression codec used to compress the data stored. ZLIB is the default for ORC.
Index Data - includes min and max values for each column and the row positions within each column. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block. Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.
Row data - Actual data. Is used in table scans.
Stripe Footer - The stripe footer contains the encoding of each column and the directory of the streams including their location. To describe each stream, ORC stores the kind of stream, the column id, and the stream’s size in bytes. The details of what is stored in each stream depends on the type and encoding of the column.
Postscript - holds compression parameters and the size of the compressed footer.
File Footer - The file footer contains a list of stripes in the file, the number of rows per stripe, and each column's data type. It also contains column-level aggregates count, min, max, and sum.
Now! Talking about your output from orcfiledump.
First is general information about your file. The name, location, compression codec, compression size etc.
Stripe statistics will list all the stripes in your ORC file and its corresponding information. You can see counts and some statistics about Integer columns like min, max, sum etc.
File statistics is similar to #2. Just for the complete file as opposed to each stripe in #2.
Last part, the Stripe section, talks about each column in your file and corresponding index info for each of it.
Also, you can use various options with orcfiledump to get "desired" results. Follows a handy guide.
// Hive version 0.11 through 0.14:
hive --orcfiledump <location-of-orc-file>
// Hive version 1.1.0 and later:
hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump]
[--backup-path <new-path>] <location-of-orc-file-or-directory>
Follows a quick guide to the options used in the commands above.
Specifying -d in the command will cause it to dump the ORC file data
rather than the metadata (Hive 1.1.0 and later).
Specifying --rowindex with a comma separated list of column ids will
cause it to print row indexes for the specified columns, where 0 is
the top level struct containing all of the columns and 1 is the first
column id (Hive 1.1.0 and later).
Specifying -t in the command will print the timezone id of the
writer.
Specifying -j in the command will print the ORC file metadata in JSON
format. To pretty print the JSON metadata, add -p to the command.
Specifying --recover in the command will recover a corrupted ORC file
generated by Hive streaming.
Specifying --skip-dump along with --recover will perform recovery
without dumping metadata.
Specifying --backup-path with a new-path will let the recovery tool
move corrupted files to the specified backup path (default: /tmp).
is the URI of the ORC file.
is the URI of the ORC file or
directory. From Hive 1.3.0 onward, this URI can be a directory
containing ORC files.
Hope that helps!

How to select rows where hours and minutes = 0 using da.searchcursor

I am trying to exclude some results from numerous tables in an arc.sde database and the only field I can use is a date field. I have researche the Python2 site and tried to understand page 8.1 regarding datetime etc but not been able to achieve my goal yet. (Using Win 7, ArcGIS 10.2, Python 2.7.5 and mixed OS environment)
The code below runs fine.....
with arcpy.da.SearchCursor(fc, ['LASTEDIT_ON']) as cursor:
for row in cursor:
if row[0] <> None:
print str(row[0])
But I need it to exclude the rows returned where the hours/minutes/seconds are all 00:00:00.
2014-05-13 16:16:34
2014-09-26 11:45:15
2015-06-18 14:47:05
2015-02-03 10:38:50
2008-03-10 00:00:00
2007-06-06 00:00:00
I tried adding hour and minutes to my code but Im totally on the wrong track I think. Error as below.
if row[0] <> None and datetime.hour <> 0:
Error Info:
'module' object has no attribute 'hour'
If your date field is a real date field, not a text field, the following code will print the dates with hour, minute and second all null:
import arcpy
with arcpy.da.SearchCursor(fc, 'LASTEDIT_ON') as cursor:
for row in cursor:
if row[0].hour == 0:
if row[0].minute == 0:
if row[0].second == 0:
print row[0]

Resources