Find the most and least values of 8 calculated fields - dashboard

I have 8 calculated fields in Tableau and I need to find the lowest 2 of them and highest value. How do I do that? is that possible in Tableau?

There are actually two functions named MIN() -- the same applies to MAX(), but I'll just talk about MIN().
The most commonly used takes a single argument - and is an aggregate function, such as MIN([Sales]). It operates on multiple records and returns the minimum value found in the whole set of records. It is translated into the SQL MIN() function for relational data sources.
The other version of MIN() takes two arguments, such as MIN([x], [y]). It is a record level function and operates on a single record and returns the minimum value of its two arguments for that record. It is translated into the SQL LEAST() function for relational data sources.
You want to use the second form. Unfortunately, Tableau restricts it to two arguments, but you can simply nest multiple calls as in MIN(MIN(a, b), c)

Related

FILTER vs CALCULATE DAX Function

Is there a difference between the FILTER and CALCULATE Dax functions?
From what I understand, they produce the same result and are synonymous.
Is this true?
They aren't remotely the same.
FILTER takes a table expression as its first argument and iterates through all the rows of that table checking the condition provided in the second argument. It returns a table.
CALCULATE calculates a scalar expression (first argument) in the filter context determined by the remaining arguments. It returns a single scalar value.

The column of the csv file in google automl tables is recognised as text or categorical instead of numeric as i would like

I tried to train a model using google automl tables but i have the following problem
The csv file is correctly imported, it has 2 columns and about 1870 rows, all numeric.
The system recognises only 1 column as numeric but not the other.
The column, where the problem is, has 5 digits in each row separated with space.
Is there anything i should do in order for the system to properly recognise the data as numeric?
Thanks in advance for your help
The issue is with the Data type Numeric definition, the number needs to be comparable (greater than, smaller than, equal).
Two different list of numbers are not comparable, for example 2 4 7 is not comparable to 1 5 7. To solve this, without using strings and therefore losing the "information" of those numbers, you have several options.
For example:
Create an array of numbers, by inserting [ ] in the limits of the second entrance. Take into consideration the Array Data type relative weighted approach in AutoMl tables as it may affect the "information" extracted from the sequence.
Create additional columns for every entry of the second column so each one is a single number and hence truly numeric.
I would personally go for the second option.
If you are afraid of losing "information" by splitting the numbers take into consideration that after training, the model should deduce by itself the importance of the position and other "information" those number sequences might contain (mean, norm/modulus,relative increase,...) provided the training data is representative.

Bind variables results in full table scan in Oracle

Checking the query cost on a table with 1 million records results in full table scan while the same query in oracle with actual values results in significant lesser cost.
Is this expected behaviour from Oracle ?
Is there a way to tell Oracle not to scan the full table ?
The query is scanning the full table when bind variables are used:
The query cost reduces significantly with actual variables:
This is a pagination query. You want to retrieve a handful of records from the table, filtering on their position in the filtered set. Your projection includes all the columns of the table, so you need to query the table to get the whole row. The question is, why do the two query variants have different plans?
Let's consider the second query. You are passing hard values for the offsets, so the optimizer knows that you want the eleven most recent rows in the sorted set. The set is sorted by an indexed column. The most important element is that the optimizer knows you want 11 rows. 11 is a very small sliver of one million, so using an indexed read to get the required rows is an efficient way of doing things. The path starts at the far end of the index, reads the last eleven entries and retrieves the rows.
Now, your first query has bind variables for the starting and finishing offsets and also for the number of rows to be returned. This is crucial: the optimizer doesn't know whether you want to return eleven rows or eleven thousand rows. So it opts for a very high cardinality. The reason for this is that index reads perform very badly for retrieving large numbers of rows. Full table scans are the best way of handling big slices of our tables.
Is this expected behaviour from Oracle ?
Now you understand this you will can see that the answer to this question is yes. The optimizer makes the best decision it can with the information we give it. When we provide hard values it can be very clever. When we provide vague data it has to guess; sometimes its guesses aren't the ones we expected.
Bind variables are very useful for running the same query with different values when the expected result set is similar. But using bind variables to specify ranges means the result sets can potentially vary tremendously in size.
Is there a way to tell Oracle not to scan the full table ?
If you can fix the pagesize, thus removing the :a2 parameter, that would allow the optimizer to produce a much more accurate plan. Alternatively, if you need to vary the pagesize within a small range (say 10 - 100) then you could try a /*+ cardinality (100) */ hint in the query; provided the cardinality value is within the right order of magnitude it doesn't have to be the precise value.
As with all performance questions, the devil is in the specifics. So you need to benchmark various performance changes and choose the best fit for your particular use case(s).

Dynamic tuple in Pig?

I obtained data with tuple in pig
0,(0),(zero)
1,(1,2),(first,second)
Can I receive this?
0,0,zero
1,1,first
1,2,second
To start off, I'm going to correct your terminology and you should be treating (0) and (1,2) as bags, not tuples. Tuples are intended to be fixed-length data structures that represent some sort of entity. Say (name, address, year of birth), for example. If you have a list of similar objects, like {(apple), (orange), (banana)}, you want a bag.
There doesn't exist behavior that allows you to "zip" up multiple bags/lists. The reason for this is from a design perspective, Pig treats bags as unordered lists, hence the term "bag" not "list". This assumption really helps out with parallelism since you don't have to be considered with order. Therefore, it's really hard to match up 1 with first.
What you could try to do is write an eval function UDF that takes in two bags as parameters, and then zips up the two lists, and then returns back one back with the bags zipped.

What's a simple way to design a memoization system with limited memory?

I am writing a manual computation memoization system (ugh, in Matlab). The straightforward parts are easy:
A way to put data into the memoization system after performing a computation.
A way to query and get data out of the memoization.
A way to query the system for all the 'keys'.
These parts are not so much in doubt. The problem is that my computer has a finite amount of memory, so sometime the 'put' operation will have to dump some objects out of memory. I am worried about 'cache misses', so I would like some relatively simple system for dropping the memoized objects which are not used frequently and/or are not costly to recompute. How do I design that system? The parts I can imagine it having:
A way to tell the 'put' operation how costly (relatively speaking) the computation was.
A way to optionally hint to the 'put' operation how often the computation might be needed (going forward).
The 'get' operation would want to note how often a given object is queried.
A way to tell the whole system the maximum amount of memory to use (ok, that's simple).
The real guts of it would be in the 'put' operation when you hit the memory limit and it has to cull some objects, based on their memory footprint, their costliness, and their usefulness. How do I do that?
Sorry if this is too vague, or off-topic.
I'd do it by creating a subclass to DYNAMICPROPS that uses a cell array to store the data internally. This way, you can dynamically add more data to the object.
Here's the basic design idea:
The data is stored in a cell array. Each property gets its own row, with the first column being the property name (for convenience), the second column a function handle to calculate the data, the third column the data, the fourth column the time it took to generate the data, the fifth column an array of, say, length 100 storing the timestamps corresponding to when the property was accessed the last 100 times, and the sixth column contains the variable size.
There is a generic get method that takes as input the row number corresponding to the property (see below). The get method first checks whether column 3 is empty. If no, it returns the value and stores the timestamp. If yes, it performs the computation using the handle from column 1 inside a TIC/TOC statement to measure how expensive the computation is (which is stored in col4, and the timestamp is stored in col5). Then it checks whether there is enough space for storing the result. If yes, it stores the data, otherwise it checks size, as well as the product of how many times data were accessed with how long it would take to regenerate, to decide what to cull.
In addition, there is an 'add' property, that adds a row to the cell array, creates a dynamic property (using addprops) of the same name as the function handle, and sets the get-method to myGetMethod(myPropertyIndex). If you need to pass parameters to the function, you can create an additional property myDynamicPropertyName_parameters with a set method that will remove previously calculated data whenever the parameters change value.
Finally, you can add a few dependent properties, that can tell how many properties there are (# of rows in the cell array), how they're called (first col of the cell array), etc.
Consider using Java, since MATLAB runs on top of it, and can access it. This would work if you have marshallable values (ints, doubles, strings, matrices, but not structs or cell arrays).
LRU containers are available for Java:
Easy, simple to use LRU cache in java
http://java-planet.blogspot.com/2005/08/how-to-set-up-simple-lru-cache-using.html
http://www.codeproject.com/KB/java/lru.aspx

Resources