Creating sql statements to return information from a table - datatable

I am creating sql queries to return information from a table, but I am having issues with one in particular. I want to return all of the urban areas that are in the country of colorado.
The actual definition of the query is
Return the names (name10) of all urban areas (in alphabetical order) that are entirely contained
within Colorado. Return the results in alphabetical order. (64 records)
The tables that I am using are tl_2010_us_state10 (this stores information for the states). I think I am going to use the name10 variable in this table because that has all of the names of the states.
Table "public.tl_2010_us_state10"
Column | Type | Modifiers
------------+-----------------------------+-------------------------------------
gid | integer | not null default
region10 | character varying(2) |
division10 | character varying(2) |
statefp10 | character varying(2) |
statens10 | character varying(8) |
geoid10 | character varying(2) |
stusps10 | character varying(2) |
name10 | character varying(100) |
Then I have a table that displays all the urban information. Once again I think I am going to use the name10 variable because it stores the name of all the urban areas.
Table "public.tl_2010_us_uac10"
Column | Type | Modifiers
------------+-----------------------------+-------------------------------------
gid | integer | not null default
uace10 | character varying(5) |
geoid10 | character varying(5) |
name10 | character varying(100) |
The code That I wrote in my sql was
select a.name10 from tl_2010_us_uac10 as a join tl_2010_us_state10 as b where (b.name10 = 'colorado');
but I get this error
LINE 1: ...l_2010_us_uac10 as a join tl_2010_us_state10 as b where (b.n...
gid is a primary key

You must have a join condition for an inner join. Then an order by to meet your sorting requirement.
select a.name10 as urban_area
from tl_2010_us_uac10 as a
join tl_2010_us_state10 as b
on b.gid = a.gid
where b.name10 = 'colorado'
order by a.name10;

Related

TABLE ACCESS FULL in Oracle execution plan

I have been tasked to find out the SELECT statement for an explain plan
------------------------------------------
| Id | Operation | Name |
------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | HASH JOIN RIGHT ANTI | |
| 2 | VIEW | VW_NSO_1 |
| 3 | HASH JOIN RIGHT SEMI| |
| 4 | TABLE ACCESS FULL | PART |
| 5 | TABLE ACCESS FULL | ORDERS |
| 6 | TABLE ACCESS FULL | CUSTOMER |
------------------------------------------
I am able to find the select statement from Id 0-5 but what does the line 6 mean?
This is what I have managed to figure out so far I can't get where the last sentence comes into play.
select *
from customer c join orders o
on c.custkey = o.custkey
where o_totalprice
not in
(select p_retailprice
from part p join orders o
on orders.o_custkey >= 0 and 0.1*o_totalprice >= 0)
I can't get where the last sentence comes into play?
Your query is:
select *
from customer c join orders o
on c.custkey = o.custkey
where o_totalprice
not in
(select p_retailprice
from part p join orders o
on orders.o_custkey >= 0 and 0.1*o_totalprice >= 0)
And your explain plan is
------------------------------------------
| Id | Operation | Name |
------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | HASH JOIN RIGHT ANTI | |
| 2 | VIEW | VW_NSO_1 |
| 3 | HASH JOIN RIGHT SEMI| |
| 4 | TABLE ACCESS FULL | PART |
| 5 | TABLE ACCESS FULL | ORDERS |
| 6 | TABLE ACCESS FULL | CUSTOMER |
------------------------------------------
In your case, this is what happens:
You are getting all the records from both customer and orders that match the condition based on the custkey field.
Your predicate information is delimiting the output to those where o_totalprice ( by the way it should clarified for reading easiness where this field is coming from, although I guess is from orders table ) is not part of the dataset retrieved from the subquery.
the subquery is getting all values of p_retailprice that match the join between part and orders using orders.o_custkey >= 0 and 0.1*o_totalprice >= 0
Getting this in consideration the CBO is:
Accessing ( Line 6 ) by TABLE FULL SCAN the table CUSTOMER, which is logical as you are getting all fields from the table and probably you have no index over custkey.
Making a HASH SEMI JOIN ( line 3 ) between PARTS and ORDERS. In general, a semi join is used for an in or exists clause, and the join stops as soon as the exists condition or the in condition is satisfied.
The HASH JOIN ANTI of line 1 is when the optimizer push the join predicate into a view, normally when an anti join ( not in ) is in place. This is then join to the CUSTOMER TABLE in line 6.
You are filtering only in the right table of the join ( ORDERS ) that is why the access are reflecting that.
This is just an overview of your execution plan and the reasons why the CBO is using those access paths.

how to count number of words in each column delimited by "|" seperator using hive?

input data is
+----------------------+--------------------------------+
| movie_name | Genres |
+----------------------+--------------------------------+
| digimon | Adventure|Animation|Children's |
| Slumber_Party_Massac | Horror |
+----------------------+--------------------------------+
i need output like
+----------------------+--------------------------------+-----------------+
| movie_name | Genres | count_of_genres |
+----------------------+--------------------------------+-----------------+
| digimon | Adventure|Animation|Children's | 3 |
| Slumber_Party_Massac | Horror | 1 |
+----------------------+--------------------------------+-----------------+
select *
,size(split(coalesce(Genres,''),'[^|\\s]+'))-1 as count_of_genres
from mytable
This solution covers varying use-cases, including -
NULL values
Empty strings
Empty tokens (e.g. Adventure||Animation orAdventure| |Animation )
This is a really, really bad way to store data. You should have a separate MovieGenres table with one row per movie and per genre.
One method is to use length() and replace():
select t.*,
(1 + length(genres) - length(replace(genres, '|', ''))) as num_genres
from t;
This assumes that each movie has at least one genre. If not, you need to test for that as well.

Automatically generating documentation about the structure of the database

There is a database that contains several views and tables.
I need create a report (documentation of database) with a list of all the fields in these tables indicating the type and, if possible, an indication of the minimum/maximum values and values from first row. For example:
.------------.--------.--------.--------------.--------------.--------------.
| Table name | Column | Type | MinValue | MaxValue | FirstRow |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | day | date | ‘2010-09-17’ | ‘2016-12-10’ | ‘2016-12-10’ |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | price | double | 1030.8 | 29485.7 | 6023.8 |
:------------+--------+--------+--------------+--------------+--------------:
| … | | | | | |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | day | date | ‘2014-06-20’ | ‘2016-11-28’ | ‘2016-11-16’ |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | owner | string | NULL | NULL | ‘Joe’ |
'------------'--------'--------'--------------'--------------'--------------'
I think the execution of many queries
SELECT MAX(column_name) as max_value, MIN(column_name) as min_value
FROM table_name
Will be ineffective on the huge tables that are stored in Hadoop.
After reading documentation found an article about "Statistics in Hive"
It seems I must use request like this:
ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS;
But this command ended with error:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
Do I understand correctly that this request add information to the description of the table and not display the result? Will this request work with view?
Please suggest how to effectively and automatically create documentation for the database in HIVE?

MonetDB table statistics

Below is a portion of statistics of one of my tables. I'm not sure how to understand width column. Are those values in bytes? If so, I know fname and lname have higher ascii char counts than 5 and 6 and there are some 1 char long values in mname.
Update 1.
Below is the output of select * from statistics. I'm only showing first 5 columns of the ouput.
+--------+---------+------------------------+---------+-------+
| schema | table | column | type | width |
+========+=========+========================+=========+=======+
| abc | targets | fname | varchar | 5 |
| abc | targets | mname | varchar | 0 |
| abc | targets | lname | varchar | 6 |
The column width shows the "byte-width of the atom array" (defined in gdk.h). This is however not the entire story in the case of string columns, because here the atom array only stores offsets into a string heap.
MonetDB uses variable-width columns, because if there are few distinct string values, 64-bit offsets would be a waste of memory. So in your case, the fname column needs string offsets with 5 bytes, or 40 bits, and lname needs 6 bytes (48 bits). This could change if new values are inserted.
The zero value for mname is interesting, because the width is initialised to 1 for new columns. Which version are you using?

How To Parse a String (From a different Table) in Hive (Hadoop) And Load It To a Different Table

I have this Table as an Input:
Table Name:Deals
Columns: Doc_id(BIGINT),Nv_Pairs_Feed(STRING),Nv_Pairs_Category(STRING)
For Example:
Doc_id: 4997143658422483637
Nv_Pairs_Feed: "TYPE:Wiper Blade;CONDITION:New;CATEGORY:Auto Parts and Accessories;STOCK_AVAILABILITY:Y;ORIGINAL_PRICE:0.00"
Nv_Pairs_Category: "Condition:New;Store:PartsGeek.com;"
I am trying to parse Fields: "Nv_Pairs_Feed" & "Nv_Pairs_Category" and extract their N:V Pairs (each pair is Divided by ';', and each Name and Value are divided with ':').
My goal is to insert each N:V as a Row in this table:
Doc_id | Name | Value | Source_Field
Example for desired Result:
4997143658422483637 | Condition | New | Nv_Pairs_Category
4997143658422483637 | Store | PartsGeek.com | Nv_Pairs_Category
4997143658422483637 | TYPE | Wiper Blade | Nv_Pairs_Feed
4997143658422483637 | CONDITION | New | Nv_Pairs_Feed
4997143658422483637 | CATEGORY | Auto Parts and Accessories | Nv_Pairs_Feed
4997143658422483637 | STOCK_AVAILABILITY | Y | Nv_Pairs_Feed
4997143658422483637 | ORIGINAL_PRICE | 0.00 | Nv_Pairs_Feed
You can convert the strings to a map using the standard Hive UDF str_to_map and then use the Brickhouse UDF ( http://github.com/klout/brickhouse ) map_key_values , combine and numeric_range to explode those maps. i.e Something like the following
create view deals_map_view as
select doc_id,
map_key_values(
combine( map_to_str( nv_pairs_feed, ';', ':'),
map_to_str( mv_pairs_category, ';', ':'))) as deals_map_key_values
from deals;
select
doc_id,
array_index( deals_map_key_values, i ).key as name,
array_index( deals_map_key_values, i ).value as value
from deals_map_view
lateral view numeric_range( size( feed_map_key_values) ) i1 as i
You can probably do something similar with an explode_map UDF

Resources