hive not taking values - hadoop

I am trying to import a file into hive.
The sample data is as following
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
My table declaration is as following
create table movies(id int,title string,genre string) row format delimited fields terminated by '::';
But after loading the data, my table shows data for the first two fields only.
Total MapReduce CPU Time Spent: 1 seconds 600 msec
OK
1 Toy Story (1995)
2 Jumanji (1995)
3 Grumpier Old Men (1995)
4 Waiting to Exhale (1995)
Time taken: 22.087 seconds
Can anyone help me on why this is happening or how to debug this.

Hive row delimiter will take only one character by default, since you have two characters '::' Please try Creating Table with MultiDelimitSerDe
Query:
CREATE TABLE movies (id int,title string,genre string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES ("field.delim"="::")
STORED AS TEXTFILE;
Output:
hive> select * from movies;
OK
1 Toy Story (1995) Animation|Children's|Comedy
2 Jumanji (1995) Adventure|Children's|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama
Please refer similar post:
Load data into Hive with custom delimiter

Related

how to speed up sort in hive

I would like to speed up hive process,
but I do not know how to
do it.
The data is about 200GB and about 300000000 lines text data,
and I split it into 50file in advance, then 1 file is about 4GB.
I would like to get 1 file as a result of the sort then I select the number of reducer is 1 and the number of mapper is 50.
Each line of the data consists of word and frepuency.
The same word should be grouped and frepuency of it should be sumed.
All of files are gzip files.
It takes a few day to complete the process,
and I would like to speed up
it to a few hours if I can.
Which parameter should I chgange to speed up the process?
Thank you for your reply,
Yes, I define external Hive table pointing to HDFS location.
I show my pseudo code,
create external table A count int, word string,
row format delimited fields terminated by '\t',
location 'HDFS path';
select count, word from A group by word sort by count desc;

How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape

I am currently trying to associate each lat long ping from a device to its ZIP code.
I have de-normalized lat long device ping data and created a cross-product/ Cartesian product join table in which each row has the ST_Point(long,lat), geometry_shape_of_ZIP and associated zip code for that geometry. for testing purpose I have around 45 million rows in the table and it'll increase in production to about 1 billion every day.
Even though the data is flattened and no join conditions, the query takes about 2 hours to complete. Is there any faster way to compute spatial queries? Or how can I optimize the following query.
Inline is some of the optimizations steps I have already performed. Using the optimizations all the other operations gets done in max 5 minutes except for this one step. I am using aws cluster 2 mater nodes and 5 data nodes.
set hive.vectorized.execution.enabled = true;
set hive.execution.engine=tez;
set hive.enforce.sorting=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
analyze table tele_us_zipmatch compute statistics for columns;
CREATE TABLE zipcheck (
`long4` double,
`lat4` double,
state_name string,
country_code string,
country_name string, region string,
zip int,
countyname string) PARTITIONED by (state_id string)
STORED AS ORC TBLPROPERTIES ("orc.compress" = "SNAPPY",
'orc.create.index'='true',
'orc.bloom.filter.columns'='');
INSERT OVERWRITE TABLE zipcheck PARTITION(state_id)
select long4, lat4, state_name, country_code, country_name, region, zip, countyname, state_id from tele_us_zipmatch
where ST_Contains(wkt_shape,zip_point)=TRUE;
ST_Contains is the function from esri (ref: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation#relationship-tests ).
Any help is greatly appreciated.
Thanks.
If the ZIP-code dataset can fit into memory, try a custom Map-Reduce application that uses a just-in-time in-memory quadtree index on the ZIP-code data, by adapting the sample in the GIS-Tools-for-Hadoop.
[collaborator]

hive prints null values while selecting the particular column

I have one hive table. I'm using JSON data for the hive table. When I select the whole table it works for me. If I select a particular column it prints null values.
The data looks like this
{"page_1":"{\"city\":\"Bangalore\",\"locality\":\"Battarahalli\",\"Name_of_Person\":\"xxx\",\"User_email_address\":\"test#gmail.com\",\"user_phone_number\":\"\",\"sub_locality\":\"\",\"street_name\":\"7th Cross Road, Near Reliance Fresh, T.c Palya,\",\"home_plot_no\":\"45\",\"pin_code\":\"560049\",\"project_society_build_name\":\"Sunshine Layout\",\"landmark_reference_1\":\"\",\"landmark_reference_2\":\"\",\"No_of_Schools\":20,\"No_of_Hospitals\":20,\"No_of_Metro\":0,\"No_of_Mall\":11,\"No_of_Park\":10,\"Distance_of_schools\":1.55,\"Distance_of_Hospitals\":2.29,\"Distance_of_Metro\":0,\"Distance_of_Mall\":1.55,\"Distance_of_Park\":2.01,\"lat\":13.0243273,\"lng\":77.7077906,\"ipinfo\":{\"ip\":\"113.193.30.130\",\"hostname\":\"No Hostname\",\"city\":\"\",\"region\":\"\",\"country\":\"IN\",\"loc\":\"20.0000,77.0000\",\"org\":\"AS45528 Tikona Digital Networks Pvt Ltd.\"}}","page_2":"{\"home_type\":\"Flat\",\"area\":\"1350\",\"beds\":\"3 BHK\",\"bath_rooms\":2,\"building_age\":\"1\",\"floors\":2,\"balcony\":2,\"amenities\":\"premium\",\"amenities_options\":{\"gated_security\":\"\",\"physical_security\":\"\",\"cctv_camera\":\"\",\"controll_access\":\"\",\"elevator\":true,\"power_back_up\":\"\",\"parking\":true,\"partial_parking\":\"\",\"onsite_maintenance_store\":\"\",\"open_garden\":\"\",\"party_lawn\":\"\",\"amenities_balcony\":\"\",\"club_house\":\"\",\"fitness_center\":\"\",\"swimming_pool\":\"\",\"party_hall\":\"\",\"tennis_court\":\"\",\"basket_ball_court\":\"\",\"squash_coutry\":\"\",\"amphi_theatre\":\"\",\"business_center\":\"\",\"jogging_track\":\"\",\"convinience_store\":\"\",\"guest_rooms\":\"\"},\"interior\":\"regular\",\"interior_options\":{\"tiles\":true,\"marble\":\"\",\"wooden\":\"\",\"modular_kitchen\":\"\",\"partial_modular_kitchen\":\"\",\"gas_pipe\":\"\",\"intercom_system\":\"\",\"air_conditioning\":\"\",\"partial_air_conditioning\":\"\",\"wardrobe\":\"\",\"sanitation_fixtures\":\"\",\"false_ceiling\":\"\",\"partial_false_ceiling\":\"\",\"recessed_lighting\":\"\"},\"location\":\"regular\",\"location_options\":{\"good_view\":true,\"transporation_hub\":true,\"shopping_center\":\"\",\"hospital\":\"\",\"school\":\"\",\"ample_parking\":\"\",\"park\":\"\",\"temple\":\"\",\"bank\":\"\",\"less_congestion\":\"\",\"less_pollution\":\"\"},\"maintenance\":\"\",\"maintenance_value\":\"\",\"near_by\":{\"school\":\"\",\"hospital\":\"\",\"mall\":\"\",\"park\":\"\",\"metro\":\"\",\"Near_by_school\":\"Little Champ Gurukulam Pre School \\\/ 1.52 km\",\"Near_by_hospital\":\"Suresh Hospital \\\/ 2.16 km\",\"Near_by_mall\":\"LORVEN LEO \\\/ 2.13 km\",\"Near_by_park\":\"SURYA ENCLAIVE \\\/ 2.09 km\"},\"city\":\"Bangalore\",\"locality\":\"Battarahalli\",\"token\":\"344bd4f0fab99b460873cfff6befb12f\"}"}
I created the table like this
CREATE EXTERNAL TABLE orc_test (json string)
LOCATION '/user/ec2-user/test_orc';
IF I use this query it works for me.
select * from orc_test;
If I try to select one column it prints null
select get_json_object(orc_test.json,'$.locality') as loc
from orc_test;
It prints
NULL NULL NULL
any help will be appreciated.
In your case, I think the back slashes in your data are causing the problem and also the quotes surrounding your page data. I have listed below the updated data, you could save it to a file and load it to the table, then your query should work.
{"page_1":{"city":"Bangalore","locality":"Battarahalli","Name_of_Person":"xxx","User_email_address":"test#gmail.com","user_phone_number":"","sub_locality":"","street_name":"7th Cross Road, Near Reliance Fresh, T.c Palya,","home_plot_no":"45","pin_code":"560049","project_society_build_name":"Sunshine Layout","landmark_reference_1":"","landmark_reference_2":"","No_of_Schools":20,"No_of_Hospitals":20,"No_of_Metro":0,"No_of_Mall":11,"No_of_Park":10,"Distance_of_schools":1.55,"Distance_of_Hospitals":2.29,"Distance_of_Metro":0,"Distance_of_Mall":1.55,"Distance_of_Park":2.01,"lat":13.0243273,"lng":77.7077906,"ipinfo":{"ip":"113.193.30.130","hostname":"No Hostname","city":"","region":"","country":"IN","loc":"20.0000,77.0000","org":"AS45528 Tikona Digital Networks Pvt Ltd."}},"page_2":{"home_type":"Flat","area":"1350","beds":"3 BHK","bath_rooms":2,"building_age":"1","floors":2,"balcony":2,"amenities":"premium","amenities_options":{"gated_security":"","physical_security":"","cctv_camera":"","controll_access":"","elevator":true,"power_back_up":"","parking":true,"partial_parking":"","onsite_maintenance_store":"","open_garden":"","party_lawn":"","amenities_balcony":"","club_house":"","fitness_center":"","swimming_pool":"","party_hall":"","tennis_court":"","basket_ball_court":"","squash_coutry":"","amphi_theatre":"","business_center":"","jogging_track":"","convinience_store":"","guest_rooms":""},"interior":"regular","interior_options":{"tiles":true,"marble":"","wooden":"","modular_kitchen":"","partial_modular_kitchen":"","gas_pipe":"","intercom_system":"","air_conditioning":"","partial_air_conditioning":"","wardrobe":"","sanitation_fixtures":"","false_ceiling":"","partial_false_ceiling":"","recessed_lighting":""},"location":"regular","location_options":{"good_view":true,"transporation_hub":true,"shopping_center":"","hospital":"","school":"","ample_parking":"","park":"","temple":"","bank":"","less_congestion":"","less_pollution":""},"maintenance":"","maintenance_value":"","near_by":{"school":"","hospital":"","mall":"","park":"","metro":"","Near_by_school":"Little Champ Gurukulam Pre School / 1.52 km","Near_by_hospital":"Suresh Hospital / 2.16 km","Near_by_mall":"LORVEN LEO / 2.13 km","Near_by_park":"SURYA ENCLAIVE / 2.09 km"},"city":"Bangalore","locality":"Battarahalli","token":"344bd4f0fab99b460873cfff6befb12f"}}
I tried this and it works for me.
hive> select get_json_object(orc_test.json,'$.page_1.locality') as loc from orc_test;
OK
Battarahalli
Time taken: 0.091 seconds, Fetched: 1 row(s)
hive> select get_json_object(orc_test.json,'$.page_1.city') as loc from orc_test;
OK
Bangalore
Time taken: 0.097 seconds, Fetched: 1 row(s)
hive> select get_json_object(orc_test.json,'$.page_2.home_type') as loc from orc_test;
OK
Flat
Time taken: 0.091 seconds, Fetched: 1 row(s)
It seems that you have not created table with many columns. only one column in hive table.
In hive the whole data of json value has been taken single value for a column. hence it shows null values for the columns.
use a JSON serde in order for Hive to map your JSON to the columns in your table.
Beside vmachan answer, which I think is right, the problem I encountered in similar situation was that json records weren't placed in separate lines. Also it didn't worked when it was an array. So, for example, this worked ok with Hive 3.1.0 using lateral view/json_tuple:
{"color":"black","category":"hue","type":"primary","code":{"rgba":[255,255,255,1],"hex":"#000"}}
{"color":"white","category":"value","code":{"rgba":[0,0,0,1],"hex":"#FFF"}}
{"color":"red","category":"hue","type":"primary","code":{"rgba":[255,0,0,1],"hex":"#FF0"}}
{"color":"blue","category":"hue","type":"primary","code":{"rgba":[0,0,255,1],"hex":"#00F"}}
{"color":"yellow","category":"hue","type":"primary","code":{"rgba":[255,255,0,1],"hex":"#FF0"}}
{"color":"green","category":"hue","type":"secondary","code":{"rgba":[0,255,0,1],"hex":"#0F0"}}
and that was NOT working well:
[{"color":"black","category":"hue","type":"primary","code":{"rgba":[255,255,255,1],"hex":"#000"}},
{"color":"white","category":"value","code":{"rgba":[0,0,0,1],"hex":"#FFF"}},
{"color":"red","category":"hue","type":"primary","code":{"rgba":[255,0,0,1],"hex":"#FF0"}},
{"color":"blue","category":"hue","type":"primary","code":{"rgba":[0,0,255,1],"hex":"#00F"}},
{"color":"yellow","category":"hue","type":"primary","code":{"rgba":[255,255,0,1],"hex":"#FF0"}},
{"color":"green","category":"hue","type":"secondary","code":{"rgba":[0,255,0,1],"hex":"#0F0"}}]

How to speed up the creation of a table from table partitioned by date?

I have a table with a huge amount of data. It is partitioned by week. This table contains a column named group. Each group could have multiple records of weeks. For example:
List item
gr week data
1 1 10
1 2 13
1 3 5
. . 6
2 2 14
2 3 55
. . .
I want to create a table based on one group. The creation currently is taking ~23 minutes on Oracle 11g. This is a long time since I have to repeat the process for each group and I have many groups. what is the best fastest way to create the table ?
Create all tables then use INSERT ALL WHEN
http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_9014.htm#i2145081
The data will be read only once.
insert all
when gr=1 then
into tab1 values (gr, week, data)
when gr=2 then
into tab2 values (gr, week, data)
when gr=3 then
into tab3 values (gr, week, data)
select *
from big_table
The best speed up you reach if you don't copy the data on group basis and process then week by week, but you don't say what you will reach so it is not possible to comment (this approach may be of course difficult or impracticable; but you should at least consider it).
Therefore below some hints how to extract the group data:
remove all indexes as this will only block space - all you need to do is one large FULL TABLE SCAN
check the available space and size of each group; maybe you can process several groups in one pass
deploy parallel query
.
create table tmp as
select /*+ parallel(4) */ * from BIG_TABLE
where group_id in (..list of groupIds..);
Please note that parallel mode must be enabled in the database, ask your DBA if you are unsure. The point is that the large FULL TABLE SCAN is performed by several sub-processes (here 4) which may (dependent on your mileage) cut the elapsed time.

How to handle volatile records efficiently

I have this problem and i wanted to run by others and to see if i can handle this in a better way.
We have 300 node cluster and we process transaction information/records on a daily basis. We could get ~ 10 million trasaction each day and the record size ~2K bytes each.
We currently use HDFS for data storage, pig and hive for data processing. We use the external hive table type in most cases where it is partitioned by transaction created date.
The business is such that we might get an update on a transaction that was created months or years before. Example, i might get an update of a transaction created 5 years back. We cant ignore this record but to reprocess the corresponding partition again just for a single record.
On a daily basis we end up processing 1000 partitions due to this. There are further ETL applications that uses these transaction table.
I understand that this is a limitation on hive/hdfs architecture.
I am sure that others would have had this problem, it will be really helpful if you can share the options that you might have tried and how did you over come this ?
You do not have to overwrite partitions: you can simply insert into them. Do not include the "overwrite" in your insert commands.
Following is an example of a table partitioned by date, in which I did an insert (without ovewrite!) twice - and you can see the records are there . .twice! So that shows the partition got appended, not overwritten/dropped.
insert into table insert_test do not put overwrite here! select ..
hive> select * from insert_test;
OK
name date
row1 2014-03-05
row2 2014-03-05
row1 2014-03-05
row2 2014-03-05
row3 2014-03-06
row4 2014-03-06
row3 2014-03-06
row4 2014-03-06
row5 2014-03-07
row5 2014-03-07
row6 2014-03-09
row7 2014-03-09
row6 2014-03-09
row7 2014-03-09
row8 2014-03-16
row8 2014-03-16

Resources