How can I download all observations from Hue/Hive output? - hadoop

I am struggling with such a problem. My output table after executing a query on Hue/Hive has 1,2 mln of observations. When I try to download results as an .csv format there is only a possibility to download firt 1 mln of observations. I know that I can execute a query, select firs 0,9 mln of observations and download results and then execute a query to extract last 0,3 mln of observations and download results and merge then in for example R statistical package. But maybe anyone knows how to do it in a single approach?

You could bump the limit to more than 1 million but beware it might slowdown Hue: https://github.com/cloudera/hue/blob/master/desktop/conf.dist/hue.ini#L741
An alternative is to do a CREATE TABLE AS SELECT ... (this will scale but won't be CSV by default)

The easy solution for this would be to save the output in a HDFS directory and then download data from there.Use a query like this to store the results:
insert overwrite directory "$path" select * from ...

Related

Spoon run slow from Postgres to Oracle

I have an ETL Spoon that read a table from Postgres and write into Oracle.
No transformation, no sort. SELECT col1, col2, ... col33 from table.
350 000 rows in input. The performance is 40-50 rec/sec.
I try to read/write the same table from PS to PS with ALL columns (col1...col100) I have 4-5 000 rec/sec
The same if I read/write from Oracle to Oracle: 4-5 000 rec/sec
So, for me, is not a network problem.
If I try with another table Postgres and only 7 columns, the performances are good.
Thanks for the help.
It happened same in my case also, while loading data from Oracle and running it on my local machine(Windows) the processing rate was 40 r/s but it was 3000 r/s for Vertica database.
I couldn't figure it out what was the exact problem but I found a way to increase the row count. It worked from me. you can also do the same.
Right click on the Table Input steps, you will see "Change Number Of Copies to Start"
Please include below in the where clause, This is to avoid duplicates. Because when you choose the option "Change Number Of Copies to Start" the query will be triggered N number of time and return duplicates but keeping below code in where clause will get only distinct records
where ora_hash(v_account_number,10)=${internal.step.copynr}
v_account_number is primary key in my case
10 is, say for example you have chosen 11 copies to start means, 11 - 1 = 10 so it is up to you to set.
Please note this will work, I suggest you to use on local machine for testing purpose but on the server definitely you will not face this issue. so comment the line while deploying to servers.

Hive Length outputs more than seen

I am trying to run a hive query which should join two table with matching records. However, it never matches but i have the record in the other table. When i do length of a given string it outputs 27, but it should be just 12.
When i download the output file from s3 then i see weird row like
U S 3 F F 1 2 1 4 9 3 3
but in hive console it see it as
US3FF1214933
Also i cannot query the row with
select * from table where item like "US3FF1214933";
It is totally a mess right now and trimming also does not work for me.
I am in need of help.
Thanks in advance,
Thanks to legato for giving me an idea to investigate this by doing
od -c and seeing actual characters between the string.
And after in hive query using regexp_replace(ExString,'\0',"") to replace the weird characters with empty string solved my issue.

Sequence Number UDF in Hive

i have tried this UDF in hive : UDFRowSequence.
But its not generating unique value i.e it is repeating the sequence depending on mappers.
Suppose i have one file (Having 4 records) availble at HDFS .it will create one mapper for this job and result will be like
1
2
3
4
but when there are multiple file (large size) at HDFS Location , Multiple mapper will get created for that job and for each mapper repetitive sequence number will get generated like below
1
2
3
4
1
2
3
4
1
2
.
Is there any solution for this so that unique number should be generated for each record
I think you are looking for ROW_NUMBER(). You can read about it and other "windowing" functions here.
Example:
SELECT *, ROW_NUMBER() OVER ()
FROM some_database.some_table
#GoBrewers14 :- Yes i did try that. We tried to use the ROW_NUMBER function ,but when we try to query this on small size data eg. file containing 500 rows, it’s working perfectly. But when it comes to large size data, query runs for couple of hours and finally fails to generate output .
I have came to know below information regarding this :-
Generating a sequential order in a distributed processing query is not possible with simple UDFs. This is because the approach will require some centralised entity to keep track of the counter, which will also result in severe inefficiency for distributed queries and is not recommended to apply.
If you want work with multiple mappers and with large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo
It makes use of zookeeper as central repository to maintain state of sequence
Query to generate Sequences. We can use this as Surrogate Key in Dimension table as well.
WITH TEMP AS
(SELECT if(max(seq) IS NULL, 0, max(seq)) max_seq
FROM seq_test)
SELECT col_id,
col_val,
row_number() over() + max_seq AS seq
FROM souce_table
INNER JOIN TEMP ON 1 = 1;
seq_test: Its your target table.
source_table: Its your source.
Seq: Surrogate key / Sequence number / Key column

Forcing a reduce phase or a second map reduce job in hive

I am running a hive query of the following form:
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT /*+ MAPJOIN(...) */ * FROM ...
Because of the MAPJOIN, the result does not require a reduce phase. The map phase uses about 5000 mappers, and it ends up taking about 50 minutes to complete the job. It turns out that most of this time is spent copying those 5000 files to the local directory.
To try to optimize this, I replaced SELECT * ... with SELECT DISTINCT * ... (I know in advance that my results are already distinct, so this doesn't actually change my result), in order to force a second map reduce job. The first map reduce job is the same as before, with 5000 mappers and 0 reducers. The second map reduce job now has 5000 mappers and 3 reducers. With this change, there are now only 3 files to be copied, rather than 5000, and the query now only takes a total of about 20 minutes.
Since I don't actually need the DISTINCT, I'd like to know whether my query can be optimized in a less kludge-y way, without using DISTINCT?
What about wrapping you query with another SELECT, and maybe a useless WHERE clause to make sure it kicks off a job.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT *
FROM (
SELECT /*+ MAPJOIN(...) */ *
FROM ..
) x
WHERE 1 = 1
I'll run this when I get a chance tomorrow and delete this part of the answer if it doesn't work. If you get to it before me then great.
Another option would be to take advantage of the virtual columns for file name and line number to force distinct results. This complicates the query and introduces two meaningless columns, but has the advantage that you no longer have to know in advance that your results will be distinct. If you can't abide the useless columns, wrap it in another SELECT to remove them.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT {{enumerate every column except the virutal columns}}
FROM (
SELECT DISTINCT /*+ MAPJOIN(...) */ *, INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE
FROM ..
) x
Both solutions are more kludge-y than what you came up with, but have the advantage that you are not limited to queries with distinct results.
We get another option if you aren't limited to Hive. You could get rid of the LOCAL and write the results to HDFS, which should be fast even with 5000 mappers. Then use hadoop fs -getmerge /result/dir/on/hdfs/ to pull the results into the local filesystem. This unfortunately reaches out of Hive, but maybe setting up a two step Oozie workflow is acceptable for your use case.

How to reduce cost on select statement?

I have a table in oracle 10g with around 51 columns and 25 Million number of records in it. When I execute a simple select query on the table to extract 3 columns I am getting the cost too high around 182k. So I need to reduce the cost effect. Is there any possible way to reduce it?
Query:
select a,b,c
from X
a - char
b - varchar2
c - varchar2
TIA
In cases like this it's difficult to give good advice without knowing why you would need to query 25 million records. As #Ryan says, normally you'd have a WHERE clause; or, perhaps you're extracting the results into another table or something?
A covering index (i.e. over a,b,c) would probably be the only way to make any difference to the performance - the query could then do a fast full index scan, and would get many more records per block retrieved.
Well...if you know you only need a subset of those values, throwing a WHERE clause on there would obviously help out quite a bit. If you truly need all 25 million records, and the table is properly indexed, then I'd say there's really not much you can do.
Yes, better telling the purpose of the select as jeffrey Kemp said.
If normal select, you just need to give index to your fields that mostly you can do, provide table statistic on index (DBMS_STATS.GATHER_TABLE_STATS), check the statistic of each field to be sure your index is right (Read: http://bit.ly/qR12Ul).
If need to load to another table, use cursor, limit the records of each executing and load to the table via the bulk insert (FORALL technique).

Resources