hive prints null values while selecting the particular column - hadoop

I have one hive table. I'm using JSON data for the hive table. When I select the whole table it works for me. If I select a particular column it prints null values.
The data looks like this
{"page_1":"{\"city\":\"Bangalore\",\"locality\":\"Battarahalli\",\"Name_of_Person\":\"xxx\",\"User_email_address\":\"test#gmail.com\",\"user_phone_number\":\"\",\"sub_locality\":\"\",\"street_name\":\"7th Cross Road, Near Reliance Fresh, T.c Palya,\",\"home_plot_no\":\"45\",\"pin_code\":\"560049\",\"project_society_build_name\":\"Sunshine Layout\",\"landmark_reference_1\":\"\",\"landmark_reference_2\":\"\",\"No_of_Schools\":20,\"No_of_Hospitals\":20,\"No_of_Metro\":0,\"No_of_Mall\":11,\"No_of_Park\":10,\"Distance_of_schools\":1.55,\"Distance_of_Hospitals\":2.29,\"Distance_of_Metro\":0,\"Distance_of_Mall\":1.55,\"Distance_of_Park\":2.01,\"lat\":13.0243273,\"lng\":77.7077906,\"ipinfo\":{\"ip\":\"113.193.30.130\",\"hostname\":\"No Hostname\",\"city\":\"\",\"region\":\"\",\"country\":\"IN\",\"loc\":\"20.0000,77.0000\",\"org\":\"AS45528 Tikona Digital Networks Pvt Ltd.\"}}","page_2":"{\"home_type\":\"Flat\",\"area\":\"1350\",\"beds\":\"3 BHK\",\"bath_rooms\":2,\"building_age\":\"1\",\"floors\":2,\"balcony\":2,\"amenities\":\"premium\",\"amenities_options\":{\"gated_security\":\"\",\"physical_security\":\"\",\"cctv_camera\":\"\",\"controll_access\":\"\",\"elevator\":true,\"power_back_up\":\"\",\"parking\":true,\"partial_parking\":\"\",\"onsite_maintenance_store\":\"\",\"open_garden\":\"\",\"party_lawn\":\"\",\"amenities_balcony\":\"\",\"club_house\":\"\",\"fitness_center\":\"\",\"swimming_pool\":\"\",\"party_hall\":\"\",\"tennis_court\":\"\",\"basket_ball_court\":\"\",\"squash_coutry\":\"\",\"amphi_theatre\":\"\",\"business_center\":\"\",\"jogging_track\":\"\",\"convinience_store\":\"\",\"guest_rooms\":\"\"},\"interior\":\"regular\",\"interior_options\":{\"tiles\":true,\"marble\":\"\",\"wooden\":\"\",\"modular_kitchen\":\"\",\"partial_modular_kitchen\":\"\",\"gas_pipe\":\"\",\"intercom_system\":\"\",\"air_conditioning\":\"\",\"partial_air_conditioning\":\"\",\"wardrobe\":\"\",\"sanitation_fixtures\":\"\",\"false_ceiling\":\"\",\"partial_false_ceiling\":\"\",\"recessed_lighting\":\"\"},\"location\":\"regular\",\"location_options\":{\"good_view\":true,\"transporation_hub\":true,\"shopping_center\":\"\",\"hospital\":\"\",\"school\":\"\",\"ample_parking\":\"\",\"park\":\"\",\"temple\":\"\",\"bank\":\"\",\"less_congestion\":\"\",\"less_pollution\":\"\"},\"maintenance\":\"\",\"maintenance_value\":\"\",\"near_by\":{\"school\":\"\",\"hospital\":\"\",\"mall\":\"\",\"park\":\"\",\"metro\":\"\",\"Near_by_school\":\"Little Champ Gurukulam Pre School \\\/ 1.52 km\",\"Near_by_hospital\":\"Suresh Hospital \\\/ 2.16 km\",\"Near_by_mall\":\"LORVEN LEO \\\/ 2.13 km\",\"Near_by_park\":\"SURYA ENCLAIVE \\\/ 2.09 km\"},\"city\":\"Bangalore\",\"locality\":\"Battarahalli\",\"token\":\"344bd4f0fab99b460873cfff6befb12f\"}"}
I created the table like this
CREATE EXTERNAL TABLE orc_test (json string)
LOCATION '/user/ec2-user/test_orc';
IF I use this query it works for me.
select * from orc_test;
If I try to select one column it prints null
select get_json_object(orc_test.json,'$.locality') as loc
from orc_test;
It prints
NULL NULL NULL
any help will be appreciated.

In your case, I think the back slashes in your data are causing the problem and also the quotes surrounding your page data. I have listed below the updated data, you could save it to a file and load it to the table, then your query should work.
{"page_1":{"city":"Bangalore","locality":"Battarahalli","Name_of_Person":"xxx","User_email_address":"test#gmail.com","user_phone_number":"","sub_locality":"","street_name":"7th Cross Road, Near Reliance Fresh, T.c Palya,","home_plot_no":"45","pin_code":"560049","project_society_build_name":"Sunshine Layout","landmark_reference_1":"","landmark_reference_2":"","No_of_Schools":20,"No_of_Hospitals":20,"No_of_Metro":0,"No_of_Mall":11,"No_of_Park":10,"Distance_of_schools":1.55,"Distance_of_Hospitals":2.29,"Distance_of_Metro":0,"Distance_of_Mall":1.55,"Distance_of_Park":2.01,"lat":13.0243273,"lng":77.7077906,"ipinfo":{"ip":"113.193.30.130","hostname":"No Hostname","city":"","region":"","country":"IN","loc":"20.0000,77.0000","org":"AS45528 Tikona Digital Networks Pvt Ltd."}},"page_2":{"home_type":"Flat","area":"1350","beds":"3 BHK","bath_rooms":2,"building_age":"1","floors":2,"balcony":2,"amenities":"premium","amenities_options":{"gated_security":"","physical_security":"","cctv_camera":"","controll_access":"","elevator":true,"power_back_up":"","parking":true,"partial_parking":"","onsite_maintenance_store":"","open_garden":"","party_lawn":"","amenities_balcony":"","club_house":"","fitness_center":"","swimming_pool":"","party_hall":"","tennis_court":"","basket_ball_court":"","squash_coutry":"","amphi_theatre":"","business_center":"","jogging_track":"","convinience_store":"","guest_rooms":""},"interior":"regular","interior_options":{"tiles":true,"marble":"","wooden":"","modular_kitchen":"","partial_modular_kitchen":"","gas_pipe":"","intercom_system":"","air_conditioning":"","partial_air_conditioning":"","wardrobe":"","sanitation_fixtures":"","false_ceiling":"","partial_false_ceiling":"","recessed_lighting":""},"location":"regular","location_options":{"good_view":true,"transporation_hub":true,"shopping_center":"","hospital":"","school":"","ample_parking":"","park":"","temple":"","bank":"","less_congestion":"","less_pollution":""},"maintenance":"","maintenance_value":"","near_by":{"school":"","hospital":"","mall":"","park":"","metro":"","Near_by_school":"Little Champ Gurukulam Pre School / 1.52 km","Near_by_hospital":"Suresh Hospital / 2.16 km","Near_by_mall":"LORVEN LEO / 2.13 km","Near_by_park":"SURYA ENCLAIVE / 2.09 km"},"city":"Bangalore","locality":"Battarahalli","token":"344bd4f0fab99b460873cfff6befb12f"}}
I tried this and it works for me.
hive> select get_json_object(orc_test.json,'$.page_1.locality') as loc from orc_test;
OK
Battarahalli
Time taken: 0.091 seconds, Fetched: 1 row(s)
hive> select get_json_object(orc_test.json,'$.page_1.city') as loc from orc_test;
OK
Bangalore
Time taken: 0.097 seconds, Fetched: 1 row(s)
hive> select get_json_object(orc_test.json,'$.page_2.home_type') as loc from orc_test;
OK
Flat
Time taken: 0.091 seconds, Fetched: 1 row(s)

It seems that you have not created table with many columns. only one column in hive table.
In hive the whole data of json value has been taken single value for a column. hence it shows null values for the columns.
use a JSON serde in order for Hive to map your JSON to the columns in your table.

Beside vmachan answer, which I think is right, the problem I encountered in similar situation was that json records weren't placed in separate lines. Also it didn't worked when it was an array. So, for example, this worked ok with Hive 3.1.0 using lateral view/json_tuple:
{"color":"black","category":"hue","type":"primary","code":{"rgba":[255,255,255,1],"hex":"#000"}}
{"color":"white","category":"value","code":{"rgba":[0,0,0,1],"hex":"#FFF"}}
{"color":"red","category":"hue","type":"primary","code":{"rgba":[255,0,0,1],"hex":"#FF0"}}
{"color":"blue","category":"hue","type":"primary","code":{"rgba":[0,0,255,1],"hex":"#00F"}}
{"color":"yellow","category":"hue","type":"primary","code":{"rgba":[255,255,0,1],"hex":"#FF0"}}
{"color":"green","category":"hue","type":"secondary","code":{"rgba":[0,255,0,1],"hex":"#0F0"}}
and that was NOT working well:
[{"color":"black","category":"hue","type":"primary","code":{"rgba":[255,255,255,1],"hex":"#000"}},
{"color":"white","category":"value","code":{"rgba":[0,0,0,1],"hex":"#FFF"}},
{"color":"red","category":"hue","type":"primary","code":{"rgba":[255,0,0,1],"hex":"#FF0"}},
{"color":"blue","category":"hue","type":"primary","code":{"rgba":[0,0,255,1],"hex":"#00F"}},
{"color":"yellow","category":"hue","type":"primary","code":{"rgba":[255,255,0,1],"hex":"#FF0"}},
{"color":"green","category":"hue","type":"secondary","code":{"rgba":[0,255,0,1],"hex":"#0F0"}}]

Related

Import massive table from Oracle to PostgreSQL with oracle-fdw return ORA-01406

I work on a project to transfer data from an Oracle database to a PostgreSQL database to create a datawarehouse with bash & SQL scripts. To access to the Oracle database, I use the PostgreSQL extension oracle-fdw.
One of my scripts import data from a massive table (~ 100 000 000 new rows/day). This table is partitioned and each partition contains 1 day of data. The query I use to import data looks like that :
INSERT INTO postgre_target_table (some_fields)
SELECT some_aggregated_fields -- (~150 fields)
FROM oracle_source_table
WHERE partition_id = :v_partition_id AND some_others_filters
GROUP BY primary_key;
On DEV server, the query works fine (there is much less data on this server) but in PREPROD, it returns the error ORA-01406: fetched column value was truncated.
In some posts, people say that the output fields may be too small but if I try to send a simple SELECT query without INSERT or GROUP BY I have the same error.
Another idea I found in another post is to create an Oracle side view but in my query I use multiple parameters that I cannot use in a view.
The last idea I found is to create an Oracle stored procedure that fills a table with aggregated data and then import data from this table but the Oracle database is critical and my customer prefers to avoid adding more data on it.
Now, I'm starting to think there's no solution and it's not good...
PostgreSQL version : 12.4 / Oracle version : 11.2
UPDATE
It seems my problem is more complecated than I thought.
After applying the modification given by Laurenz Albe, the query runs correctly on PGAdmin but the problem still appears when I use psql command.
Moreover, another query seems to have the same problem. This other query does not use the same source table as the first query, it uses 4 joined tables without any partition. The common point between these queries is the structure.
The detail I omit to specify in the original post is that the purpose of both queries is to pivot a table. They look like that :
SELECT osr.id,
MIN(CASE osr.category
WHEN 123 THEN
1
END) AS field1,
MIN(CASE osr.category
WHEN 264 THEN
1
END) AS field2,
MIN(CASE osr.category
WHEN 975 THEN
1
END) AS field3,
...
FROM oracle_source_table osr
WHERE osr.category IN (123, 264, 975, ...)
GROUP BY osr.id;
Now that I have detailed what the queries look like, I can give you some results I had with the second one without changing the value of max_long (this query is lighter than the first one) :
Sometimes it works (~10%), sometimes it failed (~90%) on PGadmin but it never works with psql command
If I delete the WHERE, it always works
I don't understand why deleting the WHERE change something, the field used in this clause is a NUMBER(6, 0) between 0 and 2500 and it is still used in the SELECT clause... Oh and in the 4 Oracle tables used by this query, there is no LONG datatype, only NUMBER datatype is used.
Among 20 queries I have, only these two have a problem, their structure is similar and I don't believe in coincidences.
Don't despair!
Set the max_long option on the foreign table big enough that all your oversized data fit.
The documentation has the details:
max_long (optional, defaults to "32767")
The maximal length of any LONG, LONG RAW and XMLTYPE columns in the Oracle table. Possible values are integers between 1 and 1073741823 (the maximal size of a bytea in PostgreSQL). This amount of memory will be allocated at least twice, so large values will consume a lot of memory.
If max_long is less than the length of the longest value retrieved, you will receive the error message
ORA-01406: fetched column value was truncated
Example:
ALTER FOREIGN TABLE my_tab OPTIONS (ADD max_long '1000000');

In Hive query status return OK but no records are shown.

select *from mca_info;
OK
Time taken: 0.934 seconds
The above statement is shown.
But earlier that query worked.
Oh ......
I Find it.
Actually data loading from HDFS file was not successful.
Actually table was empty.
When manually I inserted data in mca_info by using query "insert into table....", and then "select *from mca_info" works.
Then the data was shown.

hive not taking values

I am trying to import a file into hive.
The sample data is as following
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
My table declaration is as following
create table movies(id int,title string,genre string) row format delimited fields terminated by '::';
But after loading the data, my table shows data for the first two fields only.
Total MapReduce CPU Time Spent: 1 seconds 600 msec
OK
1 Toy Story (1995)
2 Jumanji (1995)
3 Grumpier Old Men (1995)
4 Waiting to Exhale (1995)
Time taken: 22.087 seconds
Can anyone help me on why this is happening or how to debug this.
Hive row delimiter will take only one character by default, since you have two characters '::' Please try Creating Table with MultiDelimitSerDe
Query:
CREATE TABLE movies (id int,title string,genre string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES ("field.delim"="::")
STORED AS TEXTFILE;
Output:
hive> select * from movies;
OK
1 Toy Story (1995) Animation|Children's|Comedy
2 Jumanji (1995) Adventure|Children's|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama
Please refer similar post:
Load data into Hive with custom delimiter

java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hbase.client.Result cannot be cast to org.apache.hadoop.io.Writable

Tried one sample of handling a table in hbase from hive.
The CREATE EXTERNAL TABLE command was successful, but the select statement gives a class cast exception
ENV:
hive 0.12.0, hbase 0.96.1, hadoop 2.2 , Ubuntu 12.04 on Virtual box
hive> SHOW TABLES;
OK
hbatablese_myhive
Time taken: 0.309 seconds, Fetched: 1 row(s)
hive> SELECT * FROM hbatablese_myhive;
OK
**Failed with exception
java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hbase.client.Result cannot be cast to org.apache.hadoop.io.Writable**
Time taken: 1.179 seconds
hive>
The same table on HBASE console:
hbase(main):002:0> scan 'myhive'
ROW COLUMN+CELL
row1 column=ratings:userid, timestamp=1392886585074, value=user1
row2 column=ratings:userid, timestamp=1392886606457, value=user2
2 row(s) in 0.0520 seconds
There used to be a Writables.copyWritable(Result result, Result value) call in an old version of the next(Immutablebyteswritable key, Result value) method of TableRecordReaderImpl.java.
the copyWritable has been removed now and only works with Writable, Writable arguments.
To do a copy now, you need to use value.copyFrom(result). This will do a deep copy of the data from source to destination
I am guessing you have some library mismatch which is making these calls happen and trying to cast from Result, Result to Writable, Writable

Adobe Air SQLite performance vs SQLite cli

I have simple SQL table structure where there is three tables: entry (~70k rows), tag (~27k rows) and entry_tag (~2.5M rows). entry_tag -table contains relations between entries and tags (n-n relation) like this entry_tag (entry_id, tag_id, ...).
When I first tried to fetch all the entry_ids with particular tag, I made query like
SELECT entry_id FROM entry_tag et, tag t
WHERE t.id=et.tag_id AND t.name LIKE 'Something';
For no apparent reason query took over 6 secs on Adobe Air app on Win7 x64 (i5-2500K, 8GB RAM, 7200rpm hdd), which is way too slow (particularly with more complex INTERSECT queries). In SQLite command line tool query returned results immediately. I compared EXPLAIN and EXPLAIN QUERY PLAN between cli and Air (using Air based SQLite Sorcerer) and the results where identical. Both queries were using indices.
After I restructured the query by switching the order of the tables, performance got multiplied in Air to 250ms, which is satisfactory:
SELECT entry_id FROM tag t, entry_tag et
WHERE et.tag_id=t.id AND t.name LIKE 'Something';
So I was left wondering, why the cli tool was so much faster than Adobe Air environment? For background information I have generated the db file with PHP (all columns INTEGER or TEXT).
Here is the query plan for the original slow query:
sele order from deta
---- ------------- ---- ----
0 0 0 SCAN TABLE entry_tag AS et USING INDEX idx_et_eid (~1000000 rows)
0 1 1 SEARCH TABLE tag AS t USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
(~1000000 rows looks anyways bad ;)

Resources