I am working with a large dataset in Rstudio, with 21 columns of data each filled with information from many time points (roughly 92 rows). I can work out the mean for each column individually but am really struggling to calculate all the means at once, thus producing a table of 21 mean results. Is there a way of doing this? I'm wondering if part of the problem is that my columns have a alpha-numeric title?
(Apologies if it's really easy, I just don't seem to be getting anywhere with it).
Thanks in advance!
If your data frame is named df, you can use
colMeans(df)
You can easily store that to a dataframe itself as
means_df <- data.frame(colMeans(df))
There are many ways. If you like to do it by using base R, use this
apply(df,2,mean)
If you like to do it by using dplyr package, try this
library(dplyr)
df %>% summarize_each(funs(mean))
If you like to use data.table package, then it will be
library(data.table)
dt <- data.table(df)
dt[, lapply(.SD, mean)]
Data
df <- data.frame(A=rnorm(100),B=runif(100),C=1:100)
Related
We have a giant file which we repartitioned according to one column, for example, say it is STATE. Now it seems like after repartitioning, the data cannot be sorted completely. We are trying to save our final file as a text file but instead of the first state listed being Alabama, now California shows up first. OrderBy doesn't seem to have an effect after running the repartition.
df = df.repartition(100, ['STATE_NAME'])\
.sortWithinPartitions('STATE_NAME', 'CUSTOMER_ID', 'ROW_ID')
I can't find a clear statement in the documentation about this, only this hint for pyspark.sql.DataFrame.repartition:
The resulting DataFrame is hash partitioned.
Obviously, repartition doesn't bring the rows in a specific (namely alphabetic) order (not even if they were ordered previously), it only groups them. That .sortWithinPartitions imposes no global order is no wonder considering the name, which implies that the sorting only occurs within the partitions, not on them. You can try .sort instead.
I have a google spreadsheet which matches submodules and product. Below is what sheet looks like.
What I want to do is transform this data to matrix such like below.
Can I make it with one function? I want to use it in Mmult function so it would be better matrix without labels.
I cannot catch any ID to get through it. Hope I get good clues to do it. Thanks a lot.
Not quite the result you show from the data sample provided, but I think at least close to what you want:
=query(A2:C5,"select B, sum(C) group by B pivot A")
To get a 0 in the pivot table I added a row in the source data.
I have a serialized blob and a function that converts it into a java Map.
I have registered the function as a UDF and tried to use it in Spark SQL as follows:
sqlCtx.udf.register("blobToMap", Utils.blobToMap)
val df = sqlCtx.sql(""" SELECT mp['c1'] as c1, mp['c2'] as c2 FROM
(SELECT *, blobToMap(payload) AS mp FROM t1) a """)
I do succeed in doing it, but for some reason the very heavy blobToMap function runs twice for every row, and in reality I extract 20 fields and it runs 20 times for every row. I saw the suggestions in Derive multiple columns from a single column in a Spark DataFrame
but they are really not scalable - I don't want to create a class for every time I need to extract data.
How can I force Spark to do what's reasonable?
I tried to separate to two stages. The only thing that worked was to cache the inner select - but that's not feasible either because it is really a big blob and I need only a few dozen fields from it.
I'll answer myself hoping it will help anyone.. so after dozens of experiments I was able to force spark to evaluate the udf and turn it into a Map once, instead of recalculating it over and over again for every key request, by splitting the query and doing an evil ugly trick - turning it ti RDD and back to DataFrame:
val df1 = sqlCtx.sql("SELECT *, blobToMap(payload) AS mp FROM t1")
sqlCtx.createDataFrame(df.rdd, df.schema).registerTempTable("t1_with_mp")
val final_df = sqlCtx.sql("SELECT mp['c1'] as c1, mp['c2'] as c2 FROM t1_with_mp")
I am currently using a dsum to calculate some totals and I noticed excel has become really slow (needs 2 seconds per cell change).
This is the situation:
- I am trying to calculate 112 dsums to show in a chart;
- all dsums are queries on a table with 15 columns and +32k rows;
- all dsums have multiple criteria (5-6 constraints);
- the criteria uses both numerical and alpha-numerical constraints;
- i have the source table/range sorted;
- excel file is 3.4 mb in size;
(I am using excel 2007 on an 4 year old windows laptop)
Any ideas on what can be done to make it faster?
...other than reducing the number of dsums :P ====>>> already working on that one.
Thanks!
Some options are:
Change Calculation to Manual and press F9 whenever you want to calculate
Try SUMIFS rather than DSUM
Exploit the fact that the data is sorted by using MATCH and COUNTIF to find the first row and count of rows, then use OFFSET or INDEX to get the relevant subset of data to feed to SUMIFS for the remaining constraints
Instead of DSUMs you could also put it all in one or multiple Pivot tables - and then use GETPIVOTDATA to extract the data you need. The reading of the table will take up a bit of time (though 32k rows should be done below 1") - and then GETPIVOTDATA is lightning fast!
Downsides:
You need to manually refresh the pivot when you get new data
The pivot(s) need to be laid out so the requested data is show
File size will increase (unless Pivot cache is not stored, the file loading takes longer)
I'am using JasperReport and ireport 4.0 , I want to know If their the possibility to create a table that can I fix lines and columns? Because the only the table that I have found allowed me just to fix columns !!
And
For the charts I have just an integer values but I dont know what the scale use float numbers!
Update:
what I mean that ireport allowed this format:
and I want the following format:
Thank you
Typically you have a varying number of rows, because the number of rows depend on the data from your database.
To have a known number of rows you either have to make sure that your data has the expected number of rows, or you design your detail section in a way that corresponds to your desired outcome. The height of the detail section is flexible, and you can put various text fields not only side by side, but also on top of each other.