Sampling Issue with hive - hadoop

"all_members" is a table in hive with 10m rows and 1 column: "membership_nbr". I want to sample 3000 rows. This is what I have done:
hive>create table sample_members as select * from all_members limit 1;
hive>insert overwrite table sample_members select membership_nbr from all_members tablesample(3000 rows);
hive>select count(*) from sample_members;
OK 45000
The result wont change if I replace 3000 rows with 300 rows
Do I do something wrong?

Table Sampling using tablesample(3000 rows) wont fetch 3000 rows from entire table instead it will fetch 3000 rows from each input split.
So, your query might run 15 mappers. So, each mapper will fetch 3000 rows. Totally, 3000 * 15 = 45000 rows. Also, if you change the 3000 rows to 300 rows you will get 4500 rows as output after sampling.
So, as per your requirement you have to give tablesample(200 rows). As a result each mapper will fetch 200 rows. Finally, 15 mappers will fetch 3000 sampling rows.
Refer the below link for various types of sampling:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

Related

Clickhouse query with a LIMIT clause inefficiently reads too many rows

I'm querying Clickhouse with a query that has ORDER BY and LIMIT 1, and the ORDER BY matches the table's sort order. The query returns 1 row as expected, however, 50+ rows were scanned to return the result.
I would expect ClickHouse to scan only 1 row as the ORDER BY is in the table's sort order. What's happening here and what can I do to fix this?
SELECT * FROM comp_intel_scrapes
order by
client_slug,
client_hotel_id,
argset_id,
scrape_datetime,
preferred_country,
preferred_currency,
adults,
children,
nights,
min_checkin_date,
max_checkin_date
limit 1
----
Elapsed: 0.004s
Read: 54 rows (8.84KB)
By the way, Clickhouse.com's cloud is being used here.
It depends on a table engine.
Primary index is sparse https://clickhouse.com/docs/en/guides/improving-query-performance/sparse-primary-indexes/sparse-primary-indexes-design/
Because of this CH is unable to read less than one granule ~8192 rows.

Talend : How to fix the code of method is exceeding the 65535 byte limit

I have a set of 5 tables that have 2 millions rows and 450 columns approximately
My job look like this :
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap---tUnite---tDBOutput
tDBInput 1 ---tMap-----
tDBInput 1 ---tMap-----
It's my 5 tables tables that I'm trying to union, with the tMap where I'm adding an Id to trace which table date come from + reduce number of columns (from 450 to 20)
Then I unite the 5 in one tUnite that load a table in Truncate - Insert mode
I'm trying to make it work but always have the same error which is "The code of method tDBInput-10Process is exceeding the 65535 bytes limit"
If you use only 20 of 450 columns, you could select only those columns in each of your tDBInput, instead of extracting all columns and filtering them in tMap.

How to obtain a random sample of 100K users with all their transactions in hive?

I have a huge dataset containing information on millions of users and their purchases recorded for 1 year. Is there a way to create a random sample of 100K users (keeping all their individual purchases) from this data? Since a user can have more than one purchase, the sample will contain more than 100k records.
I was able to find the rand() function but it does not give me all records for the users.
I tried this query:
select *
from mytable
where rand()< 0.025 and mydate between '20140101' and '20141231'
distribute by rand()
sort by rand()
limit 100000
This result produces only the 100k random records and not all the records for these 100k users.
Any suggestions on how to write a hive query to obtain this results?
You should create table of 100,000 random userIDs first:
CREATE table Random_Users AS
Select * From (Select distinct userId From my table) users
where rand()< 0.025 limit 100000;
Then you can do
Select mytable.* From mytable m JOIN random_users r ON (m.userId = r.userId);

Split amount into multiple rows if amount>=$10M or <=$-10B

I have a table in oracle database which may contain amounts >=$10M or <=$-10B.
99999999.99 chunks and also include remainder.
If the value is less than or equal to $-10B, I need to break into one or more 999999999.99 chunks and also include remainder.
Your question is somewhat unreadable, but unless you did not provide examples here is something for start, which may help you or someone with similar problem.
Let's say you have this data and you want to divide amounts into chunks not greater than 999:
id amount
-- ------
1 1500
2 800
3 2500
This query:
select id, amount,
case when level=floor(amount/999)+1 then mod(amount, 999) else 999 end chunk
from data
connect by level<=floor(amount/999)+1
and prior id = id and prior dbms_random.value is not null
...divides amounts, last row contains remainder. Output is:
ID AMOUNT CHUNK
------ ---------- ----------
1 1500 999
1 1500 501
2 800 800
3 2500 999
3 2500 999
3 2500 502
SQLFiddle demo
Edit: full query according to additional explanations:
select id, amount,
case
when amount>=0 and level=floor(amount/9999999.99)+1 then mod(amount, 9999999.99)
when amount>=0 then 9999999.99
when level=floor(-amount/999999999.99)+1 then -mod(-amount, 999999999.99)
else -999999999.99
end chunk
from data
connect by ((amount>=0 and level<=floor(amount/9999999.99)+1)
or (amount<0 and level<=floor(-amount/999999999.99)+1))
and prior id = id and prior dbms_random.value is not null
SQLFiddle
Please adjust numbers for positive and negative borders (9999999.99 and 999999999.99) according to your needs.
There are more possible solutions (recursive CTE query, PLSQL procedure, maybe others), this hierarchical query is one of them.

Performance issue in hive version 0.13.1

I use AWS-EMR to run my Hive queries and I have a performance issue while running hive version 0.13.1.
The newer version of hive took around 5 minutes for running 10 rows of data. But the same script for 230804 rows is taking 2 days and is still running. What should I do to analyze and fix the problem?
Sample Data:
Table 1:
hive> describe foo;
OK
orderno string
Time taken: 0.101 seconds, Fetched: 1 row(s)
Sample data for table1:
hive>select * from foo;
OK
1826203307
1826207803
1826179498
1826179657
Table 2:
hive> describe de_geo_ip_logs;
OK
id bigint
startorderno bigint
endorderno bigint
itemcode int
Time taken: 0.047 seconds, Fetched: 4 row(s)
Sample data for Table 2:
hive> select * from bar;
127698025 417880320 417880575 306
127698025 3038626048 3038626303 584
127698025 3038626304 3038626431 269
127698025 3038626560 3038626815 163
My Query:
SELECT b.itemcode
FROM foo a, bar b
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
In the very top of your Hive log output, it states "Warning: Shuffle Join JOIN[4][Tables a, b] in Stage 'Stage-1 Mapred' is a cross product."
EDIT:
A 'cross product' or Cartesian product is a join without conditions, which returns every row in the 'b' table, for every row in the 'a' table. So, if you take an example of 'a' is 5 rows, and 'b' is 10 rows, you get the product, or, 5 multiplied by 10 = 50 rows returned. There will be a lot of rows that are completely 'null' for one or the other tables.
Now, if you have a table 'a' of 20,000 rows and join it to another table 'b' of 500,000 rows, you are asking the SQL engine to return to you a data set 'a, b' of 10,000,000,000 rows, and then perform the BETWEEN operation on the 10-million rows.
So, if you drop the number of 'b' rows, you see you will get more benefit than the 'a' - in your example, if you can filter the ip_logs table, table 2, since I am making a guess that it has more rows than your order number table, it will cut down on the execution time.
END EDIT
You're forcing the execution engine to work through a Cartesian product by not specifying a condition for the join. It's having to scan all of table a over and over. With 10 rows, you will not have a problem. With 20k, you are running into dozens of map/reduce waves.
Try this query:
SELECT b.itemcode
FROM foo a JOIN bar b on <SomeKey>
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
But I'm having trouble figuring out what column your model will allow joining on. Maybe the data model for this expression could be improved? It may just be me not reading the sample clearly.
Either way, you need to filter the number of comparisons BEFORE the where clause. Other ways I have done this in Hive is to make a view with a smaller set of data, and join/match the view instead of the original table.

Resources