how to get distinct domain counts from Frame in h2o - h2o

In H2O, when we parse .csv file to Frame object how can we get distinct values count of a particular column(Vec).
For example, consider a column Fruits which has apple 3 times and mango 2 times. After parsing it to a frame, we can get distinct values using the domain() method, but how do you get distinct values along with their counts? In the example, I would be looking for:
apple,3
mango,2

you're looking for h2o.table
From R:
fr <- as.h2o(iris)
h2o.table(iris[,"Species"])
From python:
fr["Species"].table()

Related

Writing a formula in a cell in Google Sheets that averages the results from a column derived from expected values in multiple columns

I'm an average user of Google sheets and I've tried writing/looking up the formula I'm going for, but I haven't had any luck yet.
I have a spreadsheet that details multiple values that I need to display in a single cell the average of a certain set of values derived from a specific set of those values from multiple columns.
The flow of information would look something along the lines of:
if value in Column D=L
then
if value in Column J<$1.20
then
Find Avg of all Values in Column N
I'd need the formula to narrow it's field of data each time so the final result was the average of all the values in Column N that had a value in column J<$1.20 with a value in Column D=L.
I feel like a dummy over here because I just can't narrow down how I should write this flow and get it to work right without adding multiple extra hidden columns. Can anyone help on this one?
I've tried writing the formula multiple different ways but haven't kept it written down to pass on.

How to get the sum of values of a column in tmap?

I have 2 columns - Matches(Integer), Accounts_type(String). And i want to create a third column where i want to get proportions of matches played by different account types. I am new to Talend & am facing issue with this for past 2 days & did a lot of research but to no avail. Please help..
You can do it like this:
You need to read your source data twice (I used tFixedFlowInput_1 and tFixedFlowInput_2 with the same data). The idea is to calculate the total of your matches in tAggregateRow_1, it simply does a sum of all Matches without a group by column, then use that as a lookup.
The tMap then joins your source data with the calculated total. Since the total will always be one record, you don't need any join column. You then simply divide Matches by Total as required.
This is supposing you have unique values in Account_type; if you don't, you need to add another tAggregateRow between your source and tMap_1, in order to get sum of Matches for each Account_type (group by Account_type).

Power Query, avg value based on the values appearing within a specified date range

Context:
I have a data set for the weights of truck and trailer combinations coming into my site over the span of a few years. I have organized my data by seasons as I am trying to prove that the truck:trailers in winter are noticeably heavier due to ice, snow, and mud. The theory is, if the tare weight is higher in this season (the weight of the truck after it empties its load) than its Avg tare weight (which I need to calculate from the data) it can be deduced that the truck:trailer combinations are coming in with extra weight that we pay for in part as some snow/ice/mud falls off in the trailer emptying process.
What I've done so far:
I've defined a custom date range for my seasons
I've grouped Truck:Trailer by: count to get a duplicates column and, all rows to keep all my details
I've filtered out every combination I've seen less than 50 times, as i want good representation for each truck:trailer combo so that I can better emphasize repeated patterns
I've added an index column to better keep track of the individuals before expanding the details
What I need to do:
I only want to work with truck:trailer combinations which have weighed in for all four seasons at least once
I need to find the average tare weight of the truck:trailer combinations based over the extended range for both summer and autumn (the dry time of the year) while preserving the raw tare data for all seasons, as I need to eventually compare the winter tare values to this average.
example of my data
When I'm finished I'd like the data to look something like this
Pivot Chart
query data
For your first question (all seasons) you can add a column that holds the distinct count of the values in [Season] for each [Driver:Trailer]. Then filter your table on that column, keeping only the 4's. To achieve this, add the following m-code to your script in the Advanced Editor. Change the part after in to #"DistinctCount Season"
#"DistinctCount Season" = Table.Join(#"insert name previous step","Driver:Trailer",
Table.Group(#"insert name previous step", {"Driver:Trailer"},
{{"DistinctCountSeasons", each Table.RowCount(Table.Distinct(_,"Season")),
type number}}),"Driver:Trailer")
Insert the name of your previous step where indicated.
For second question:
You can use a matrix-visual for that in you report. First create a measure:
[AverageTare] = AVERAGE(table'[Tare])
Then put [Season] on Rows and the [AverageTare] on Values. You can create a group (right-click on [Season] in the FIELDS-pain) called [DrySeason], to combine the values for Spring and Summer.
If that doesn't work for you, explore the AVERAGEX function.
EDIT
In excel you can use a pivottable. Put [Season] on Rows and the [AverageTare] on Values. Right-click a value in the pivottable. Select Value Field Setting and choose Average. Then select the Seasons you want to group, right-click and select Group.
EDIT 2
To add a column in the Power Query Editor that holds the average [Tare] for the [Season] in each row, add the following steps to your script in the Avanced Editor:
#"GroupedSeasonAvg" = Table.Group(#"Insert name previous step", {"Season"}, {{"AVG", each List.Average([Tare]), type number}}),
#"JoinOnSeason" = Table.NestedJoin(#"Insert name previous step",{"Season"},GroupedSeasonAvg,{"Season"},"AVGGrouped"),
#"ExtractSeasonAVG" = Table.ExpandTableColumn(JoinOnSeason, "AVGGrouped", {"AVG"}, {"SeasonAVG"})
It works something like this:
"GroupedSeasonAvg" : Creates a table with the avereges for each [Season]
"JoinOnSeason": Creates a new column with tables joining the [Season] value for each row to [Season] in the grouped table.
#"ExtractSeasonAVG": Expand each table and keep only [AVG].

PowerBI filter table based on value of measure_A OR measure_B [duplicate]

We are trying to implement a dashboard that displays various tables, metrics and a map where the dataset is a list of customers. The primary filter condition is the disjunction of two numeric fields. We want to the user to be able to select a threshold for [field 1] and a separate threshold for [field 2] and then impose the condition [field 1] >= <threshold> OR [field 2] >= <threshold>.
After that, we want to also allow various other interactive slicers so the user can restrict the data further, e.g. by country or account manager.
Power BI naturally imposes AND between all filters and doesn't have a neat way to specify OR. Can you suggest a way to define a calculation using the two numeric fields that is then applied as a filter within the same interactive dashboard screen? Alternatively, is there a way to first prompt the user for the two threshold values before the dashboard is displayed -- so when they click Submit on that parameter-setting screen they are then taken to the main dashboard screen with the disjunction already applied?
Added in response to a comment:
The data can be quite simple: no complexity there. The complexity is in getting the user interface to enable a disjunction.
Suppose the data was a list of customers with customer id, country, gender, total value of transactions in the last 12 months, and number of purchases in last 12 months. I want the end-user (with no technical skills) to specify a minimum threshold for total value (e.g. $1,000) and number of purchases (e.g. 10) and then restrict the data set to those where total value of transactions in the last 12 months > $1,000 OR number of purchases in last 12 months > 10.
After doing that, I want to allow the user to see the data set on a dashboard (e.g. with a table and a graph) and from there select other filters (e.g. gender=male, country=Australia).
The key here is to create separate parameter tables and combine conditions using a measure.
Suppose we have the following Sales table:
Customer Value Number
-----------------------
A 568 2
B 2451 12
C 1352 9
D 876 6
E 993 11
F 2208 20
G 1612 4
Then we'll create two new tables to use as parameters. You could do a calculated table like
Number = VALUES(Sales[Number])
Or something more complex like
Value = GENERATESERIES(0, ROUNDUP(MAX(Sales[Value]),-2), ROUNDUP(MAX(Sales[Value]),-2)/10)
Or define the table manually using Enter Data or some other way.
In any case, once you have these tables, name their columns what you want (I used MinNumber and MinValue) and write your filtering measure
Filter = IF(MAX(Sales[Number]) > MIN(Number[MinCount]) ||
MAX(Sales[Value]) > MIN('Value'[MinValue]),
1, 0)
Then put your Filter measure as a visual level filter where Filter is not 0 and use MinCount and MinValues column as slicers.
If you select 10 for MinCount and 1000 for MinValue then your table should look like this:
Notice that E and G only exceed one of the thresholds and tha A and D are excluded.
To my knowledge, there is no such built-in slicer feature in Power BI at the time being. There is however a suggestion in the Power BI forum that requests a functionality like this. If you'd be willing to use the Power Query Editor, it's easy to obtain the values you're looking for, but only for hard-coded values for your limits or thresh-holds.
Let me show you how for a synthetic dataset that should fit the structure of your description:
Dataset:
CustomerID,Country,Gender,TransactionValue12,NPurchases12
51,USA,M,3516,1
58,USA,M,3308,12
57,USA,M,7360,19
54,USA,M,2052,6
51,USA,M,4889,5
57,USA,M,4746,6
50,USA,M,3803,3
58,USA,M,4113,24
57,USA,M,7421,17
58,USA,M,1774,24
50,USA,F,8984,5
52,USA,F,1436,22
52,USA,F,2137,9
58,USA,F,9933,25
50,Canada,F,7050,16
56,Canada,F,7202,5
54,Canada,F,2096,19
59,Canada,F,4639,9
58,Canada,F,5724,25
56,Canada,F,4885,5
57,Canada,F,6212,4
54,Canada,F,5016,16
55,Canada,F,7340,21
60,Canada,F,7883,6
55,Canada,M,5884,12
60,UK,M,2328,12
52,UK,M,7826,1
58,UK,M,2542,11
56,UK,M,9304,3
54,UK,M,3685,16
58,UK,M,6440,16
50,UK,M,2469,13
57,UK,M,7827,6
Desktop table:
Here you see an Input table and a subset table using two Slicers. If the forum suggestion gets implemented, it should hopefully be easy to change a subset like below to an "OR" scenario:
Transaction Value > 1000 OR Number or purchases > 10 using Power Query:
If you use Edit Queries > Advanced filter you can set it up like this:
The last step under Applied Steps will then contain this formula:
= Table.SelectRows(#"Changed Type2", each [NPurchases12] > 10 or [TransactionValue12] > 1000
Now your original Input table will look like this:
Now, if only we were able to replace the hardcoded 10 and 1000 with a dynamic value, for example from a slicer, we would be fine! But no...
I know this is not what you were looking for, but it was the best 'negative answer' I could find. I guess I'm hoping for a better solution just as much as you are!

Convert Array#select to active record query in rails 4

I'm writing a custom search function, and I have to filter through an association.
I have 2 active record backed models, cards and colors with a has_many_and_belongs_to, and colors have an attribute color_name
As my DB has grown to around 10k cards, my search function gets exceptionally slow because i have a select statement with a query inside it, so essentially im having to make thousands of queries.
i need to convert the array#select method into an active record query, that will yield the same results, and im having trouble coming up with a solution. the current (relevant code) is the following:
colors = [['Black'], ['Blue', 'Black']] #this is a parameter retrieved from a form submission
if color
cards = color.flat_map do |col|
col.inject( Card.includes(:colors) ) do |memo, color|
temp = cards.joins(:colors).where(colors: {color_name: color})
memo + temp.select{|card| card.colors.pluck(:color_name).sort == col.sort}
end
end
end
the functionality im trying to mimic is that only cards with colors exactly matching the incoming array will be selected by the search (comparing two arrays). Because cards can be mono-red, red-blue, or red-blue-green etc, i need to be able to search for only red-blue cards or only mono-red cards
I initially started along this route, but i'm having trouble comparing arrays with an active record query
color_objects = Color.where(color_name: col)
Card.includes(:colors).where('colors = ?', color_objects)
returns the error
ActiveRecord::StatementInvalid: PG::SyntaxError: ERROR: syntax error
at or near "SELECT" LINE 1: ...id" WHERE "cards"."id" IN (2, 3, 4) AND
(colors = SELECT "co...
it looks to me like its failing because it doesnt want to compare arrays, only table elements. is this functionality even possible?
One solution might be to convert the habtm into has many through relation and make join tables which contain keys for every permutation of colors in order to access those directly
I need to be able to search for only green-black cards, and not have mono-green, or green-black-red cards show up.
I've deleted my previos answer, because i did not realized you are looking for the exact match.
I played a little with it and i can't see any solution without using an aggregate function.
For Postgres it will be array_agg.
You need to generate an SQL Query like:
SELECT *, array_to_string(array_agg(colors.color_name), ',')) as color_names FROM cards
JOINS cards_colors, colors
ON (cards.id = cards_colors.card_id AND colors.id = cards_colors.color_id)
GROUP BY cards.id HAVING color_names = 'green, black'
I never used those aggregators, so perhaps array_to_string is a wrong formatter, anyway you have to watch for aggregating the colors in alphabetical order. As long as you aint't having too many cards it will be slow enough, but it will scan every card in a table.
I you want to use an index on this query, you should denormalize your data structure, use an array of color_names on a cards record, index that array field and search on it. You can also keep you normalized structure and define an automatic association callback which will put the colorname to the card's color_names array every time a color is assigned to a card.
try this
colors = Color.where(color_name: col).pluck(:id)
Card.includes(:colors).where('colors.id'=> colors)

Resources