different data types associated data skew - performance

Today I read one article about the hive tuning. One paragraph is as follows:
Scene: user_id in the user table the field user_id INT, log table field both of type string type int. When two tables in accordance with the user_id Join operation, the default Hash operation will be allocated int id, this will cause all records of the string type id assigned to a reducer.
Solution: numeric type is converted to a string type
select * from users a
left outer join logs b
on a.usr_id = cast (b. user_id as string)
Can anybody give me some more explanation about the above opinion, I really cannot understand the words the author describe. Why "this will cause all records of the string type id assigned to a reducer." happened? Thanks in advance!

For starters you did not copy and paste / transcribe the original properly. Here is the more likely wording:
this will cause all records of the string type id assigned to a
single reducer.
The reason that would happen is that the conversion of string to int without the cast is probably turning it to 0. Therefore the hashing will put all of the id's into the same partition for the 0 values.

Related

Convert column value from null to value of similar row with similar values

sorry for the slightly strange Title I couldn't think of a succinct way to describe my problem.
I have a set of data that is created by one person, the data is structured as follows
ClientID ShortName WarehouseZone RevenueStream Name Budget Period
This data is manually inputted, but as there are many Clients and Many RevenueStreams only lines where budget != 0 have been included.
This needs to connect to another data set to generate revenue and there are times when revenue exists, but no budget exists.
For this reason I have gathered all customers and cross joined them to all codes and then appended these values into the main query, however as warehousezone is mnaually inputted there are a lot of entries where WarehouseZone is null.
This will always be the same for every instance of the customer.
Now after my convuluted explanation there's my question, how can I
-Psuedo Code that I hope makes sense.
SET WarehouseZone = WarehouseZone WHERE ClientID = ClientID AND
WarehouseZone != NULL
Are you sure that a client has one WarehouseZone? otherwise you need a aggregation.
Let's check, you can add a custom column that will return a record like this:
Table.Max(
Table.SelectColumns(
Table.SelectRows(#"Last Step" ,
each [ClientID] = _[ClientID])
, "Warehousezone")
,"Warehousezone"
)
This may create a new column that will bring the max warehousezone of a clientid everytime. At the end you can expand the record to get the value.
P/D The calculation is not so good for performance

how can I group sum and count with sequel ORM and postgresl?

This is too tough for me guys. It's for Jeremy!
I have two tables (although I can also envision needing to join a third table) and I want to sum one field and count rows, in the same, table while joining with another table and return the result in json format.
First of all, the data type field that needs to be summed, is numeric(10,2) and the data is inserted as params['amount'].to_f.
The tables are expense_projects which has the name of the project and the company id and expense_items which has the company_id, item and amount (to mention just the critical columns) - the "company_id" columns are disambiguated.
So, the following code:
expense_items = DB[:expense_projects].left_join(:expense_items, :expense_project_id => :project_id).where(:project_company_id => company_id).to_a.to_json
works fine but when I add
expense_total = expense_items.sum(:amount).to_f.to_json
I get an error message which says
TypeError - no implicit conversion of Symbol into Integer:
so, the first question is why and how can this be fixed?
Then I want to join the two tables and get all the project names form the left (first table) and sum amount and count items in the second table. I have tried
DB[:expense_projects].left_join(:expense_items, :expense_items_company_id => expense_projects_company_id).count(:item).sum(:amount).to_json
and variations of this, all of which fails.
I would like a result which gets all the project names (even if there are no expense entries and returns something like:
project item_count item_amount
pr 1 7 34.87
pr 2 0 0
and so on. How can this be achieved with one query returning the result in json format?
Many thanks, guys.
Figured it out, I hope this helps somebody else:
DB[:expense_projects___p].where(:project_company_id=>user_company_id).
left_join(:expense_items___i, :expense_project_id=>:project_id).
select_group(:p__project_name).
select_more{count(:i__item_id)}.
select_more{sum(:i__amount)}.to_a.to_json

I would like to know whether label column accepts such sub zero values or empty values

I would like to apply natural number sort order to the attribute representing members' age, but including sub zero values and empty values in addition to the natural human age.
I would like to know whether label column accepts such sub zero values or empty values inevitably flown into from the manually input source data like logs.
Yes!
You have to change the data type of the label from Varchar(128) to Integer.
There are two ways to do it:
run MAQL: "ALTER DATATYPE {f_dataset_name.nm_label_name} INT;"
Go to CloudConnect LDM modeler. Click on Dataset => Edit => Show
DataTypes => change datatype on label to Integer
This data type accepts also sub zero values. For "null" or "empty" values there has to be upper case null string "NULL" in the source data.

How do I use the Hive "test in(val1, val2)" built in function?

The Programming Hive book lists a test in built in function in Hive, but it is not obvious how to use it and I've been unable to find examples
Here is the information from Programming Hive:
Return type Signature Description
----------- --------- -----------
BOOLEAN test in(val1, val2, …) Return true if testequals one of the values in the list.
I want to know if it can be used to say whether a value is in a Hive array.
For example if I do the query:
hive > select id, mcn from patients limit 2;
id mcn
68900015 ["7382771"]
68900016 ["8847332","60015163","63605102","63251683"]
I'd like to be able to test whether one of those numbers, say "60015163" is in the mcn list for a given patient.
Not sure how to do it.
I've tried a number of variations, all of which fail to parse. Here are two examples that don't work:
select id, test in (mcn, "60015163") from patients where id = '68900016';
select id, mcn from patients where id = '68900016' and test mcn in('60015163');
The function is not test in bu instead in. In the table 6-5 test is a colum name.
So in order to know whether a value is in a Hive array, you need first to use explode on your array.
Instead of explode the array column, you can create an UDF, as it is explain here http://souravgulati.webs.com/apps/forums/topics/show/8863080-hive-accessing-hive-array-custom-udf-

Pig: What is the correct syntax to flatten a nested bag (2-levels deep)

I'm loading this data:
data6 = 'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
Using this command
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
I'm attempting to flatten the whole thing this command.
A_flattened = FOREACH A GENERATE item, d, things::thing AS thing; things::d1 AS d1, FLATTEN(things::values) AS value;
But I just get this error:
Invalid field projection. Projected field [things::thing] does not exist in schema: item:chararray,d:int,things:bag{:tuple(thing:chararray,d1:int,values:bag{:tuple(v:chararray)})}
I tried naming the inner things tuple, but I get a similar error. Can someone help me with the right syntax here?
You need to use things.thing, things.d1, thangs.values, as you want to do the projection on bag. The # is used by the projection on map.
Here is an introduction of Bag projection (search "Bag projection" in this page): http://ofps.oreilly.com/titles/9781449302641/intro_pig_latin.html
:: is used to avoid the name conflict when you join some inputs with the same field names. join preserves the names of the fields of the inputs passed to it. It also prepends the name of the relation the field came from, followed by a ::. For example,
-- join2key.pig
daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
jnd = join daily by (symbol, date), divs by (symbol, date);
The description of jnd is:
jnd: {daily::exchange: bytearray,daily::symbol: bytearray,daily::date: bytearray,
daily::open: bytearray,daily::high: bytearray,daily::low: bytearray,
daily::close: bytearray,daily::volume: bytearray,daily::adj_close: bytearray,
divs::exchange: bytearray,divs::symbol: bytearray,divs::date: bytearray,
divs::dividends: bytearray}
The daily:: prefix does not need to be used unless the field name is no longer unique in the record. In this example, you will need to use daily::date or divs::date if you wish to refer to one of the date fields after the join. But fields such as open and divs you do not, because there is no ambiguity.

Resources