Impala query returns data in random order - hadoop

I would like my select * query of a table to return in the same order as what is present in the DB. However, it returns the data in a random order. While executing the same query in Hive, I get the dataset in the correct order. Is there a way in which I can make impala return the result set in the same order as is present in the DB?

Whithout ORDER BY the order of rows returned by query is not defined. Due to parallel and distributed execution, the order returned may vary from run to run, some process can be executed faster, some process can wait in the queue, all of them will emit data independently from each other.
Also according to the classic Codd relational theory, the order of rows in a table and order of columns is immaterial to the database. You can sort data during insert into the table and sorted data will be compressed much better, internal indexes and bloom filters will work better, but the order of rows in returned dataset is not guaranteed without ORDER BY. The same applies to Hive, in some cases when there is single mapper has started and no reducers, the data will be returned in the same order as it is in the datafile, but do no rely on it, add ORDER BY if you need ordering.
Only single thread processing can return data in the same order, but this will kill performance. Better redesign your data flow and add some ordering column to be able to order rows during select in distributed environment.

Related

Snowflake queries with CTE seems not to cache results

When I execute a query containing a CTE (common table expression defined by WITH clause) in Snowflake, the result is not cached.
The question now is: is this how Snowflake works-as-designed, or do I need to consider something to force a result caching?
Snowflake does use the result set cache for CTEs. You can confirm that by running this simple one twice. It should show in the history table that the second one did not use a warehouse to run. Drilling down into the query profile should show the second one's execution plan is a single node, query result reuse.
with
my_cte(L_ORDERKEY) as
(select L_ORDERKEY from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."LINEITEM")
select * from MY_CTE limit 10000;
There are certain conditions that make Snowflake not use the result set cache. One of the more common ones is use of a function that can produce different results on multiple runs. For example, if a query includes current_timestamp(), that's going to change each time it runs.
Here is a complete list of the criteria that all must be met in order to use the result set cache. Even then, there's a note that meeting all of those criteria does not guarantee use of the result set cache.
https://docs.snowflake.com/en/user-guide/querying-persisted-results.html#retrieval-optimization

how to we define hbase rowkey so we get reords in optimize manner when millons of records in table

I have 30 millions of records into table but when tried to find one of records from there then it i will take to much time retrieve. Could you suggest me how I can I need to generate row-key in such a way so we can get fetch records fast.
Right now I have take auto increment Id of 1,2,3 like so on as row-key and what steps need to take to performance improvement. Let me know your concerns
generally when we come for performance to a SQL structured table, we follow some basic/general tuning like apply proper index to columns which are being used in query. apply proper logical partition or bucketing to table. give enough memory for buffer to do some complex operations.
when it comes to big data , and specially if you are using hadoop , then the real problems comes with context switching between hard disk and buffer. and context switching between different servers. you need to make sure how to reduce context switching to get better performance.
some NOTES :
use Explain Feature to know Query structure and try to improve performance.
if you are using integer row-key , then it is going to give best performance, but always create row-key/index at the beginning of table. because later performance killing.
When creating external tables in Hive / Impala against hbase tables, map the hbase row-key against a string column in Hive / Impala. If this is not done, row-key is not used in the query and entire table is scanned.
never use LIKE in row-key query , because it scans whole table. use BETWEEN or = , < , >=.
If you are not using a filter against row-key column in your query, your row-key design may be wrong. The row key should be designed to contain the information you need to find specific subsets of data

Hive Bucketing - How to run hive query for specific bucket

I have hive query which reads 5 large tables and outputs the records to next process. All these tables are partitioned on proc_dt and bucketed on user_id (5 buckets). Joins are done on user_id and filtering on proc_dt.
How can I run this query for specific bucket of all the tables? For ex. I want to run the query for just first bucket of all tables.
The reason behind doing this is, once I complete the query for first bucket, I can send the output data to next process. While next process is running I can complete query for next bucket and so on. This way next process is not waiting for entire query to finish.
If I had one more column which had Mod5 of user ID, then I would have gone for partitioning. But there is no such column and I cannot add it.
Could anyone please give me some solution for this. Any suggestions will be really helpful.
I got the answer for it. We can mention the bucket number in join query. Check the below link for more detail.
https://www.qubole.com/blog/big-data/5-tips-for-efficient-hive-queries/
You can specify partitions within query statements but not buckets. Buckets are used for optimization purposes - e.g. faster sampling and mapside joins. But they are not visible to sql statements.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables
So here is the documentation example:
CLUSTERED BY(user_id) INTO 256 BUCKETS;
This clearly does not permit access to individual buckets by value/name.

Get the top N records from two unconnected data sets

I have two Rails services that return data from distinct databases. In one data set I have records with fields that are something like this:
query, clicks, impressions
In the second I have records with fields something like this:
query, clicks, visitors
What I want to be able to do, is get paged data from the merged set, matching on queries. But it needs to also include all records that exist in one or the other data sets, and then sort them by the 'clicks' column.
In SQL if these two tables were in the same database I'd do this:
SELECT COALESCE(a.query, b.query), a.clicks, b.clicks, impressions, visitors
FROM a OUTER JOIN b ON a.query = b.query
LIMIT 100 OFFSET 1
ORDER BY MAX(a.clicks, b.clicks)
An individual "top 100" to each data set produces incorrect results because 'clicks' in data set 'a' may be significantly higher or lower than in dataset 'b'.
As they aren't in the same database, I'm looking for help with the algorithm that makes this kind of query efficient and clean.
I never found a way to do this outside of a database. In the end, we just used PostgreSQL's Foreign Data Wrapper feature to connect the two databases together and use PostgreSQL for handling the sorting and paging.
One trick for anyone heading down this path, we built VIEWs on the remote server that provided exactly the data needed in a above. This was thousands of times faster than trying to join tables across the remote connection as the value of the indexes was lost.

Which is faster in Apache Pig: Split then Union or Filter and Left Join?

I am currently processing a large input table (10^7 rows) in Pig Latin where the table is filtered on some field, processed and the processed rows are returned back into the original table. When the processed rows are returned back into the original table the fields the filters are based on are changed so that in subsequent filtering the processed fields are ignored.
Is it more efficient in Apache Pig to first split the processed and unprocessed tables on the filtering criteria, apply processing and union the two tables back together or to filter the first table, apply the process to the filtered table and perform a left join back into the original table using a primary key?
I can't say which one will actually run faster, I would simply run both versions and compare execution times :)
If you go for the first solution (split, then join) make sure to specify the smaller (if there is one) of the two tables first in the join operation (probably that's going to be the newly added data). The Pig documentation suggests that this will lead to a performance improvement because the last table is "not brought into memory but streamed through instead".

Resources