percentile_approx in hive returning zero - hadoop

I have been trying to check the percentile_approx for a set of users. The intention behind this is to get the top 25% of customers in the data set. So, in order to check that, I ran the following HIVE query.
select percentile_approx(amount, 0.75)
from sales
However, the value returned from this query is 0.0. I am not sure what the problem is. When I run this query over a sample of few records the result is what is expected.
Can anyone please shed some light on this?
Note - I am trying to find the percentile in a data set containing more than 3.3 M records.

select percentile_approx(cast(amount as double), ARRAY(0.75))
from sales
Try this method

Generally percentile_approx() works on integer type data. Please make sure that you have applied this on the column which has integers.

Related

How to implement percentile in Hive?

Can anyone please tell me ,how to implement Percentile in Hive?
I tried with percentile function,but not able to get the expected result.
Example code will greatly help.
Use the percentile function, as per the product documentation:
Returns the exact pth percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.
If you are not able to get 'expected result' then your going to add a lot more detail to your question, as in what is the data, your query, and the expected result.

Distinct count a field that has been sorted by territory from another source

I am trying to find a way to get a distinct count on a field that is being filtered by a territory without using grouping because of the fact that I need to then pass this value over to another report. The easiest way would be something like this:
distinctcount({Comm_Link.CmLi_Comm_CompanyId}) if {Company.Comp_Territory}='Atlanta'
But for obvious reasons that won't work. Any thoughts?
what you have to do is a running total. Right click on {Comm_Link.CmLi_Comm_CompanyId} insert running total, type of summary will be distinct count and on evaluate where says Use a formula type your condition {Company.Comp_Territory}="Atlanta"
your formula and approach is wrong.. I doubt whether your formula compiled with out any errors...
first create the value and then find the distinct count
if {Company.Comp_Territory}='Atlanta'
Then {Comm_Link.CmLi_Comm_CompanyId}
Now in footer write or you can get it by right click on the filed.
distinctcount({Comm_Link.CmLi_Comm_CompanyId})

rethinkdb: "RqlRuntimeError: Array over size limit" even when using limit()

I'm trying to access a constant number of the latest documents of a table ordered by a "date" key. Note that the date, unfortunately, was implemented (not by me...) such that the value is set as a string, e.g "2014-01-14", or sometimes "2014-01-14 22:22:22". I'm getting a "RqlRuntimeError: Array over size limit 102173" error message when using the following query:
r.db('awesome_db').table("main").orderBy(r.desc("date"))
I tried to overcome this problem by specifying a constant limit, since for now I only need the latest 50:
r.db('awesome_db').table("main").orderBy(r.desc("date")).limit(50)
Which ended with the same error. So, my questions are:
How can I get a constant number of the latest documents by date?
Is ordering by a string based date field possible? Is this issue has something to do with my first question?
The reason you get an error here is that the orderBy gets evaluated before the limit so it orders the entire table in memory which is over the array limit. The way to fix this is by using and index. Try doing the following:
table.indexCreate("date")
table.indexWait()
table.orderBy({index: r.desc("date")}).limit(50)
That should be equivalent to what you have there but uses an index so it doesn't require loading the entire table into memory.
This code is decision problem.
ro:= r.RunOpts{ArrayLimit: 500000}
r.DB("wrk").Table("log").Run(sessionArray[0],ro)
// This code for Python
r.db('awesome_db').table("main").run(sesion, r.runOpts{arrayLimit: 500000})

Not getting the correct totals using Cognos Report Studio. Need to get totals that show up in column

newparts_calc
if (([MonthToDateQuery].[G/L Account] = 4200 and [Query1].[G_L_Group] = 'NEW')) THEN ([Credit Amount]-[Debit Amount]) ELSE (0)
Data Item1
total([newparts_calc])
I need Data Item1 to return newparts_calc values only.
So for example in 1st row Data Item1 should be 8,540.8, but is 34,163.2
Whats wrong? how do i fix?
REVISED QUESTION
I apologize for not making sense on the original question.
I have many of the calc's that im trying to gather and put on a crosstab. I want to see sales by month (row) and part category (column)
[Query2] is the one shown in picture above.
It joins [MonthToDateQuery] AND [Query1]
The join is on 'Invoice' and carnality is 1..1 = 1..1
[MonthToDateQuery] is based on the package im working in. General ledger. It supplies the g/l entries for each sales g/l account
[Query1] is a SQL query i brought in to be able to break out categories even further from g/l group.
For example g/l account 4300 is rebuilt. However i needed to break out even further to see Rebuilt-Production and Rebuilt-New. I can do that with the g/l group.
I saw in my g/l account ledger entries that it referenced the invoice number. So thats how i tied in my SQL.
So as you can see from the table below (which is the view tabular data from query) i need a total. I have tried plugging newparts_calc into my crosstab and setting aggregation to total but the numbers still dont seem right. I dont think i have something set as it should be.
All the calc's im doing are based on single or multiple G/L Accounts and single or multiple G/L Groups.
Any Advice?
As you can see the problem seems to be duplicate invoice numbers.
How can i fix?
Couple things come to mind:
-Set the processing order to 2
-Since your calc is always a multiple and you are joining two queries, you may need to check your cardinality. Sometimes it helps to add derived queries to ensure you are working with the correct grain.
I'm obviously missing something, but if you want
I need Data Item1 to return newparts_calc values only.
just use newparts_calc, without total? That would give you proper value for row 1 -)
If you need a running-total for days (sum of values for previous days) — you should use a running_total function.
At a guess, one of your two queries is returning multiple rows for each invoice, which will cause this double counting. Look at the output of the two queries and see if that's happening. If so, then you just need to work out how to collapse that down to one row per invoice.
Per your new question - The underlying data has got to be causing the issue. Its clearly not 1:1 (note that even though this is what your stated cardinality is, Cognos does not enforce 1:1). Invoice number is not unique, GL Group is at a lower level.

Oracle spool Number rounding

I am calculating sum of all sales order (by multiplying quantity and price of a sales order - assume one sale order has only one item and using the sum function) in SQL query and I am spooling the output to a CSV file by using spool C:\scripts\output.csv.
The numeric output I get is truncated/rounded e.g. the SQL output 122393446 is made available in CSV as 122400000.
I tried to google and search on stackoverflow, but I could not get any hints about what can be done to prevent this.
Any clues?
Thanks
I think it is a xls issue.
Save as xls.
format column -> number with 2 decimals for example.
Initially I thought it might have something to do with the width of the number format which normally is 10 (NUMWIDTH) in sqlplus, but your result numeric width is 9, so that can not be the problem. Please check your query if you use a numeric type that doesn't have the required precission, and thus makes inexact calculations.

Resources