How to slice the dataset in Python in specific intervals - sorting

I have a dataset with n rows, how can I access a specific number of rows every specific number of rows through the whole dataset using Python?
For example, in 100 rows data set and I want to access 10 rows every 10 rows, like 1:10, 20:30, 40:50, 60:70, 80:90

I could think of something like this
df.iloc[np.array([int(x/10) for x in df.index]) % 2 == 0]
It takes the index of the dataframe, divides it by 10 and casts it to an int. This basically just removes the last digit in this example.
With the modulo statement the first 10 rows are True, the next 10 False and so on. This is then used with iloc to get just the lines with the True value.
This requires a continuously increasing index. If for example some rows were already filtered out this is not the case. reset_index can be used to reset the index.

Related

How to filter clickhouse table by array column contents?

I have a clickhouse table that has one Array(UInt16) column. I want to be able to filter results from this table to only get rows where the values in the array column are above a threshold value. I've been trying to achieve this using some of the array functions (arrayFilter and arrayExists) but I'm not familiar enough with the SQL/Clickhouse query syntax to get this working.
I've created the table using:
CREATE TABLE IF NOT EXISTS ArrayTest (
date Date,
sessionSecond UInt16,
distance Array(UInt16)
) Engine = MergeTree(date, (date, sessionSecond), 8192);
Where the distance values will be distances from a certain point at a certain amount of seconds (sessionSecond) after the date. I've added some sample values so the table looks like the following:
Now I want to get all rows which contain distances greater than 7. I found the array operators documentation here and tried the arrayExists function but it's not working how I'd expect. From the documentation, it says that this function "Returns 1 if there is at least one element in 'arr' for which 'func' returns something other than 0. Otherwise, it returns 0". But when I run the query below I get three zeros returned where I should get a 0 and two ones:
SELECT arrayExists(
val -> val > 7,
arrayEnumerate(distance))
FROM ArrayTest;
Eventually I want to perform this select and then join it with the table contents to only return rows that have an exists = 1 but I need this first step to work before that. Am I using the arrayExists wrong? What I found more confusing is that when I change the comparison value to 2 I get all 1s back. Can this kind of filtering be achieved using the array functions?
Thanks
You can use arrayExists in the WHERE clause.
SELECT *
FROM ArrayTest
WHERE arrayExists(x -> x > 7, distance) = 1;
Another way is to use ARRAY JOIN, if you need to know which values is greater than 7:
SELECT d, distance, sessionSecond
FROM ArrayTest
ARRAY JOIN distance as d
WHERE d > 7
I think the reason why you get 3 zeros is that arrayEnumerate enumerates over the array indexes not array values, and since none of your rows have more than 7 elements arrayEnumerates results in 0 for all the rows.
To make this work,
SELECT arrayExists(
val -> distance[val] > 7,
arrayEnumerate(distance))
FROM ArrayTest;

Displaying Max, Min, Avg across bar chart Tableau

I have a bar chart with X axis as discrete date value and Y axis as number of records.
eg: x axis (Filtered Date)- 1st Oct, 2nd Oct, 3rd Oct etc
y axis (Number of Records)- 30, 4, 3 etc
Now, I have to create a table to get Max, Min and Avg. Value of the 'Number of Record'.
I have written a Calculated Field as MAX([Number of Records]) to get the maximum of Number of Records in this case 30 but I always get a value of 1.
How do I define the values to get max, min and avg. ?
Thanks,
Number of Records is an automatically calculated field that tableau generates when importing a datasource. You can right click on it and see the definition of the calculation: 1.
As you currently have your field defined, tableau will look for the maximum value of the column. It will always be 1 because that is the only value in that field for every record.
It sounds like you are actually trying to calculate the maxiuum of the sum of the number of records for your aggregation level (in your case date). You should be able to easily accomplish this using Level of Detail (LOD) expressions, or table calculations. Something like the following:
WINDOW_MAX(SUM([Number of Records]))

Map IDs to matrix rows in Hadoop/MapReduce

I have data about users buying products. I want to create a binary matrix of size |users| x |products| such that the element (i,j) in the matrix is 1 iff user_i has bought product_j, else the value is 0.
Now, my data looks something like
userA, productX
userB, productY
userA, productZ
...
UserIds and productIds are all strings. My problem is, how to map these IDs to row indices (for users) and column indices (for products) in the matrix.
There are over a million unique userIds and roughly 3 million productIds.
To make the problem well defined: given the user1, product1 like input above, how do I convert it to something like
1,1
2,2
1,3
where userA is mapped to row 0 of the matrix, userB is mapped to row 1, productX is mapped to column 0 and so on.
Given the size of data, I would have to use Hadoop Map-Reduce but can't think of a foolproof way of efficiently doing this.
This can be solved if we can do the following:
Dump unique userIds.
Dump unique productIds.
Map each unique userId in (1) to a row index.
Map each unique productId in (2) to a column index.
I can do (1) and (2) easily but having trouble coming up with an efficient approach to solve (3) (4 will be solved if we solve 3).
I have a couple of solutions but they are not foolproof.
Solution 1 (naive) for step 3 above
Map all userIds and emit the same key (say "1") for all map tasks.
Have a long counter initialized to 0 in setup() of the reducer.
In the reduce(), emit the counter value along with the input userId and increment the counter by 1.
This would be very inefficient since all 100 million userIds would be processed by a single reducer.
Solution 2 for step 3 above
While mapping userIds, emit each userId against a key which is an integer uniformly sampled from 1,2,3....N (where N is configurable. N = 100 for example). In a way, we are partitioning the input set.
Within the mapper, use Hadoop counters to count the number of userIds assigned to each of those random partitions.
In the reducer setup, first access the counters in the mapping stage to determine how many IDs were assigned to each partition. Use these counters to determine the start and end values for that partition.
Iterate (while counting) over each userId in reduce and generate matrix rowId as start_of_partition + counter.
context.write(userId, matrix row Id)
This method should work but I am not sure how to handle cases when reducer tasks failed/killed.
I believe there should be ways of doing this which I am not aware of. Can we use hashing/modulo to achieve this? How would we handle collisions at scale?

Sum Formula Crystal Reports Inquiry

Ok, say I have a subreport that populates a chart I have from data in a table. I have a summary sum field that adds up the total of each row displayed. I am about to add two new rows that need to be displayed but not totaled up in the sum. There is a field in the table that has a number from 1-7 in it. If I added these new fields into the database, I would assign a negative number to this like -1 and -2 to differentiate it between the other records. How can I set up a formula so that it will sum up all of the amount fields except for the records that have an 'order' number we will call it of either -1 or -2? Thanks!
Use a Running Total Field and set the evaluate formula to something like {new_field} >= 0. So it will only sum the value when it passes that test.
The way to accomplish this without a running total is with a formula like this:
if {OrderNum} >= 0 Then {Amount}

R - Sorting and Sub-setting Maximum Values within Columns

I am trying to iteratively sort data within columns to extract N maximum values.
My data is set up with the first and second columns containing occupation titles and codes, and all of the rest of the columns containing comparative values (in this case location quotients that had to be previously calculated for each city) for those occupations for various cities:
*occ_code city1 ... city300*
occ1 5 ... 7
occ2 20 ... 22
. . . .
. . . .
occ800 20 ... 25
For each city I want to sort by the maximum values, select a subset of those maximum values matched by their respective occupations titles and titles. I thought it would be relatively trivial but...
edit for clarification: I want end to with a sorted subset of the data for analysis.
occ_code city1
occ200 10
occ90 8
occ20 2
occ95 1.5
At the same time I want to be able to repeat the sort column-wise (so I've tried lots of order commands through calling columns directly: data[,2]; just to be able to run the same analysis functions over the entire dataset.
I've been messing with plyr for the past 3 days and I feel like the setup of my dataset is just not conducive to how plyer was meant to be used.
I'm not exactly sure what your desired output is according to your example snippit. Here's how you could get a data frame like that for every city using plyr and reshape
#using the same df from nico's answer
library(reshape)
df.m <- melt(df, id = 1)
a.cities <- cast(df.m, codes ~ . | variable)
library(plyr)
a.cities.max <- aaply(a.cities, 1, function(x) arrange(x, desc(`(all)`))[1:4,])
Now, a.cities.max is an array of data frames, with the 4 largest values for each city in each data frame. To get one of these data frames, you can index it with
a.cities.max$X13
I don't know exactly what you'll be doing with this data, but you might want it back in data frame format.
df.cities.max <- adply(a.cities.max, 1)
One way would be to use order with ddply from the package plyr
> library(plyr)
> d<-data.frame(occu=rep(letters[1:5],2),city=rep(c('A','B'),each=5),val=1:10)
> ddply(d,.(city),function(x) x[order(x$val,decreasing=TRUE)[1:3],])
order can sort on multiple columns if you want that.
This will output the max for each city. Similar results can be obtained using sort or order
# Generate some fake data
codes <- paste("Code", 1:100, sep="")
values <- matrix(0, ncol=20, nrow=100)
for (i in 1:20)
values[,i] <- sample(0:100, 100, replace=T)
df <- data.frame(codes, values)
names(df) <- c("Code", paste("City", 1:20, sep=""))
# Now for each city we get the maximum
maxval <- apply(df[2:21], 2, which.max)
# Output the max for each city
print(cbind(paste("City", 1:20), codes[maxval]))

Resources