Create a table from a couple of summary statistics - rstudio

I'm using R Studio Version 0.98.1062 on Mac(OS X Yosemite 10.10.1).
I want to create a table (preferably to transfer it to excel or pdf format) from the data for several summary statistics describing the proportion of women enrolled in different disciplines:
summary(agriculture$X2009.PROP)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.3333 0.4881 0.4689 0.6026 1.0000
summary(economics$X2009.PROP)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.2555 0.3161 0.3218 0.3887 0.6923 29
summary(education$X2009.PROP)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.2967 0.5000 0.5490 0.8571 1.0000 46
summary(law$X2009.PROP)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.4250 0.5695 0.5324 0.6593 1.0000 28
Basically I want a table to look like this:
Discipline/SS Min.1st Qu. Median Mean 3rd Qu. Max.
agriculture 0.0000 0.3333 0.4881 0.4689 0.6026 1.0000
economics 0.0000 0.2555 0.3161 0.3218 0.6923 29
education ....
law ....
Will you be so kind to advise me how to write the code for that?

There are two basic ways you can do this: combining the data beforehand or afterwards.
Some sample data, randomly taken from the uniform distribution:
x <- runif(100)
y <- runif(100)
Combine and Summarize
If you want to combine the data beforehand, then you need to use data.frame():
d <- data.frame(variable1=x,variable2=y)
summary(d)
which will give you output like:
variable1 variable2
Min. :0.03026 Min. :0.01173
1st Qu.:0.29410 1st Qu.:0.24968
Median :0.48517 Median :0.47524
Mean :0.51137 Mean :0.47865
3rd Qu.:0.71354 3rd Qu.:0.69512
Max. :0.98465 Max. :0.980
(Note that you can also do data.frame() without specifying column names, in which case the names of the variables will be used as column names.) This might take some work to wrangle it into the format you want, but it would probably be the better format for later analyses in R. (d is now in the "wide format", from which it is not difficult to translate into the standard "long format" via packages like reshape or its successor reshape2).
As a side bar, you could use cbind() (column bind) instead of data.frame, in which case you would now have a matrix instead of a data frame. For purely numerical values and simple summary statistics, this doesn't make a huge difference. I mention this only as a parallel to rbind() (see below) -- typically observations are stored in data frames instead of plain matrices (i.e. semantically richer storage).
Summarize and Combine
If you want to combine the summaries, you can use rbind() (row bind) to combine the summaries.
xs <- summary(x)
ys <- summary(y)
s <- rbind(xs,ys)
print(s)
which will give you output like this:
Min. 1st Qu. Median Mean 3rd Qu. Max.
xs 0.03026 0.2941 0.4852 0.5114 0.7135 0.9847
ys 0.01173 0.2497 0.4752 0.4787 0.6951 0.9803
From there, it should be easy enough to use the built-in functions for writing tabular data to file, see ?write.table. Excel can open both tab-separated and CSV files. If you want to go directly to PDF, then you need to take a look at exporting to LaTeX via the xtable package and/or using RMarkdown to generate a report. Printing tables with those systems is well documented elsewhere online.

Related

gnomAD database allele frequency

I need help in understanding the gnomAD allele frequency.
I need to filter the variants having < 1 % in population. I have seen in some annotated file "gnomAD highest frequency" column, on the basis of this the other scientists have filtered out < 1% variants.
In my file i can only see AF in my gnomAD table. Also the numbers in AF column are like 0.9876 , 0.087 but not in percentage form.
My question is should i take AF column for selecting <1 % . Also for that i need to first multiply the numbers in AF column by 100 to get it in percentage.
Please guide me if i am on the right path or not.
Thanks in advance!
The gnomAD data sets contain many different allele frequencies for different population. AF in your file could be any of these. If it is not documented you should be able to work it out by comparing values with those on the gnomAD website. With regards which frequency to use, it all depends on what you are trying to achieve. I use AF_popmax (Maximum allele frequency across populations) to filter out common variants in the context of genomic testing for rare disease but if you want to filter out rare variants then the best frequency to use would depend on your specific use case. All the values are expressed as frequencies, ie 1 = 100%, 0.01 = 1%.

suitable formula/algorithm for detecting temperature fluctuations

I'm creating an app to monitor water quality. The temperature data is updated every 2 min to firebase real-time database. App has two requirements
1) It should alert the user when temperature exceed 33 degree or drop below 23 degree - This part is done
2) It should alert user when it has big temperature fluctuation after analysing data every 30min - This part i'm confused.
I don't know what algorithm to use to detect big temperature fluctuation over a period of time and alert the user. Can someone help me on this?
For a period of 30 minutes, your app would give you 15 values.
If you want to figure out a big change in this data, then there is one way to do so.
You can use implement the following method:
Calculate the mean and the standard deviation of the values.
Subtract the data you have from the mean and then take the absolute value of the result.
Compare if the absolute value is greater than one standard deviation, if it is greater then you have a big data.
See this example for better understanding:
Lets suppose you have these values for 10 minutes:
25,27,24,35,28
First Step:
Mean = 27 (apprx)
One standard deviation = 3.8
Second Step: Absolute(Data - Mean)
abs(25-27) = 2
abs(27-27) = 0
abs(24-27) = 3
abs(35-27) = 8
abs(28-27) = 1
Third Step
Check if any of the subtraction is greater than standard deviation
abs(35-27) gives 8 which is greater than 3.8
So, there is a big fluctuation. If all the subtracted results are less than standard deviation, then there is no fluctuation.
You can still improvise the result by selecting two or three standard deviation instead of one standard deviation.
Start by defining what you mean by fluctuation.
You don't say what temperature scale you're using. Fahrenheit, Celsius, Rankine, or Kelvin?
Your sampling rate is a new data value every two minutes. Do you define fluctuation as the absolute value of the difference between the last point and current value? That's defensible.
If the max allowable absolute value is some multiple of your 33-23 = 10 degrees you're in business.

How to take random samples for H2O data frame in R?

I have a h2o data table with 40 columns and 1 million rows. I want do a random selection of 0.3 million rows without replacement. The H2o.sample function i looked online gives the error (I've already start h2o cluster)
Error: could not find function "h2o.sample"
Is there any other way i can do this? Thanks in advance!
There is no h2o.sample() function (maybe there was in a very old version of H2O?). You can use the h2o.splitFrame() function to split your frame into pieces. This also serves as a way to take a random subset of your data frame (without replacement). The function will actually create two (or more) pieces of your data, so if you want just the 30%, here is an example in R using iris to get a ~30% random sample of the rows:
library(h2o)
h2o.init()
hf <- as.h2o(iris)
ss <- h2o.splitFrame(hf, ratios = c(0.3), seed = 1)
sub_hf <- ss[[1]] # will contain 30% of the rows
Note that for scalability reasons, h2o.splitFrame() uses "approximate splitting" which means that you won't necessarily get exactly 30% of the rows. However, the expected value is 30%, and it will closer to the desired percentage when your data is bigger. The iris is a tiny 150 row dataset, so there is more variance.

Suitable machine learning algorithm for column selection

I am new in machine learning. In my work I require a machine learning algorithm to select some columns out of many columns in a 2D matrix depending on the spread of the data. Below is a sample of the 2D matrix:
400 700 4 1400
410 710 4 1500
416 716 4 1811
..............
410 710 4 1300
Previously I have used standard deviation method to select columns depending on some threshold values(as a measure of spread of data for a particular column). Observe that the 3rd column is constant and last column in varying tremendously. 1st and 2nd column in also varying but the spread of their data is small. By applying standard deviation on each of the columns I get (sigma) = 10, 10, 0, 200 respectively.
I have considered some experimental threshold values to discard some columns. If the (sigma) crosses the threshold value range then the corresponding column gets discarded. I calculated those threshold values manually. Though this method was very simple but dealing with the threshold values is a very tedious task as there are many existing columns.
For this reason I want to use a standard machine learning algorithm or somehow if I can make these threshold values adaptive. So that I don't require to hard-code the threshold values inside the code. Can anyone please suggest me an appropriate algorithm for this?

How can I do time/hours arithmetic in Google Spreadsheet?

How do I do time/hour arithmetic in a Google spreadsheet?
I have a value that is time (e.g., 36:00:00) and I want to divide it by another time (e.g., 3:00:00) and get 12. If I divide just one by the other, I get 288:00:00 when what I want is 12 (or 12:00:00).
Note that using the hours() function doesn't work, because 36:00:00 becomes 12.
When the number being returned by your formula is being formatted as a time, and you want it formatted as a plain number, change the format of the cell to a plain number format: click the cell and then click Format, Number, Normal.
Time values in Google spreadsheet are represented as days and parts of days. For example, 36:00:00 is the formatted representation of the number 1.5 (a day and a half).
Suppose you divide 36:00:00 by 3:00:00, as in your example. Google Spreadsheet performs the calculation 1.5 divided by 0.125, which is 12. The result tells you that you have 12 3-hour intervals in a 36-hour time period. 12, of course, is not a time interval. It is a unitless quantity.
Going the other way, it is possible to format any number as a time. If you format 12 as a time, it's reasonable to expect that you will get 288:00:00. 12 days contain 288 hours.
Google Sheets now have a duration formatting option. Select: Format -> Number -> Duration.
Example of calculating time:
work-start work-stop lunchbreak effective time
07:30:00 17:00:00 1.5 8 [=((A2-A1)*24)-A3]
If you subtract one time value from another the result you get will represent the fraction of 24 hours, so if you multiply the result with 24 you get the value represented in hours.
In other words: the operation is mutiply, but the meaning is to change the format of the number (from days to hours).
You can use the function TIME(h,m,s) of google spreadsheet. If you want to add times to each other (or other arithmetic operations), you can specify either a cell, or a call to TIME, for each input of the formula.
For example:
B3 = 10:45
C3 = 20 (minutes)
D3 = 15 (minutes)
E3 = 8 (hours)
F3 = B3+time(E3,C3+D3,0) equals 19:20
I had a similar issue and i just fixed it for now
format each of the cell to time
format the total cell (sum of all the time) to Duration
I used the TO_PURE_NUMBER() function and it worked.
So much simpler: look at this
B2: 23:00
C2: 1:37
D2: = C2 - B2 + ( B2 > C2 )
Why it works, time is a fraction of a day, the comparison B2>C2
returns True (1) or False (0), if true 1 day (24 hours) is added.
http://www.excelforum.com/excel-general/471757-calculating-time-difference-over-midnight.html
if you have duration in h:mm, the actual value stored in that cell is the time converted to a real number, divided by 24 hours per day.
ex: 6:45 or 6 hours 45 minutes is 6.75 hours 6.75 hours / 24 = 0.28125 (in other words 6hrs45minutes is 28.125% of a day). If you use a column to convert your durations into actual numbers (in example, converting 6:45 into 0.28125) then you can do you multiplication or division and get the correct answer.
In the case you want to format it within a formula (for example, if you are concatenating strings and values), the aforementioned format option of Google is not available, but you can use the TEXT formula:
=TEXT(B1-C1,"HH:MM:SS")
Therefore, for the questioned example, with concatenation:
="The number of " & TEXT(B1,"HH") & " hour slots in " & TEXT(C1,"HH") _
& " is " & TEXT(C1/B1,"HH")
Cheers
In an fresh spreadsheet with 36:00:00 entered in A1 and 3:00:00 entered in B1 then:
=A1/B1
say in C1 returns 12.
Type the values in single cells, because google spreadsheet cant handle duration formats at all, in any way shape or form. Or you have to learn to make scripts and graduate as a chopper pilot. that is also a option.

Resources