Convert each multilinestring to one linestring only - geopandas

In this shapefile, the geometry column is linestring apart from 4 stream reaches (8168547, 8171738, 8170616 ,8169920) that are multilinestring.
I need to convert each multilinestring to one linestring only .
I have tried many things but none worked. For example, I tried st_cast in sf package in R. However, it increased the number of the rows (it converts each multilinestring to several linestrings).
How can I convert each multilinestring to one linestring only?

in geopandas, this can be done with explode:
import geopandas as gpd
gdf = gpd.read_file(filepath)
exploded = gdf.explode()

The {sf} way of converting multilinestrings to linestrings would be, as you mention, via sf::st_cast().
But there is a problem with your data - some of the streams are not possible to make into simple linestrings. A linestring must have a single start and a single end point - this is simply not possible for some of your rchids. As a result some of your objects end up being duplicated.
As this is a general failure - and not a R specific one - I would expect the comment to be valid also for geopandas, although I have not ran the code to verify.
I suggest first casting your object to linestrings, then identifying duplicites and filtering them out.
library(sf)
library(dplyr)
streams <- st_read("tukituki_rivStrah3.shp") %>%
select(-length) %>% # filtering out, as length is a derived metric
st_cast("LINESTRING")
duplicities <- streams %>%
st_drop_geometry() %>%
group_by(rchid) %>%
tally %>%
filter(n > 1) %>%
pull(rchid)
# this will not do...
mapview::mapview(streams[streams$rchid == duplicities[2],])
clean_streams <- streams %>%
filter(!rchid %in% duplicities)

Related

How to create a Matrix with p values from anova

I performed an ANOVA and corrected it with Tukey's test, so I got several values ​​of P.
Now I would like to build a Heatmap with these values ​​and for that I need to create an matrix with the values ​​of P to be able to make my Heat map
The first question would be how to fill a matrix with the anova p-values?
Then I made an ancova and obtained other p-values.
Now I would like to make a heatmap to compare these p-values ​​between the anova and the ancova.
Can someone help me ?
I will exemplify
anova_model <- aov( X ~ groups , data = T1)
postHocs <- glht(anova_model, linfct = mcp(groups = "Tukey"))
summary(postHocs)
This anova gave me several values ​​of P(!)
ancova_model <- aov( X ~ groups + age , data = T1)
postHocs <- glht(ancova_model, lymphct = mcp(groups = "Tukey"))
summary(postHocs)
This ancova gave me several other values ​​of P(!)
I would now like to create a Heat map to compare these P values. To see for example when age interferes a lot or not. I believe that before the ideal is to create a matrix before but I'm actually kind of lost.
Could someone help me?
Thank you very much

Performing a calculation on several data frames with a for loop

I have a group dataframes I want to create a for loop for that will perform a calculation on all of them without having to manually enter the name of the dataframe each time.
example:
df1
df2
df3
#first I try to create a list of the dataframe names to iterate through
dflist <- list(c(df1, df2, df3))
Then I attempt to iterate through it including the calculation. Simplified version here:
for (i in 1:length(dflist)) {
x <- dflist[i]$columnone[1] %>%
y <- dflist[i]$columntwo[1] %>%
z <- mean(dflist[i]$columnthree) %>%
paste0("result_",i) <- x-y/z
}
I keep being told that z cannot be found.
What am I doing wrong?
(the paste0 line at the end is meant to store the result for each dataframe as its own new variable but is not the focus of the question)

Error: requires numeric/complex matrix/vector arguments for %*%; cross validating glmmTMB model

I am adapting some k-fold cross validation code written for glmer/merMod models to a glmmTMB model framework. All seems well until I try and use the output from the model(s) fit with training data to predict and exponentiate values into a matrix (to then break into quantiles/number of bins to assess predictive performance). I can get get this line to work using glmer models, but it seems when I run the same model using glmmTMB I get Error in model.matrix: requires numeric/complex matrix/vector arguments There are many other posts out there discussing this error code and I have tried converting the data frame into matrix form and changing the class of the covariates with no luck. Separately running the parts before and after the %*% works but when combined I get the error. For context, this code is intended to be run with use/availability data so the example variables may not make sense, but the problem gets shown well enough. Any suggestions as to what is going on?
library(lme4)
library(glmmTMB)
# Example with mtcars dataset
data(mtcars)
# Model both with glmmTMB and lme4
m1 <- glmmTMB(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
m2 <- glmer(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
#--- K-fold code (hashed out sections are original glmer version of code where different)---
# define variables
k <- 5
mod <- m1 #m2
dt <- model.frame(mod) #data used
reg.list <- list() # initialize object to store all models used for cross validation
# finds the name of the response variable in the model dataframe
resp <- as.character(attr(terms(mod), "variables"))[attr(terms(mod), "response") + 1]
# define column called sets and populates it with character "train"
dt$sets <- "train"
# randomly selects a proportion of the "used"/am records (i.e. am = 1) for testing data
dt$sets[sample(which(dt[, resp] == 1), sum(dt[, resp] == 1)/k)] <- "test"
# updates the original model using only the subset of "trained" data
reg <- glmmTMB(formula(mod), data = subset(dt, sets == "train"), family=poisson,
control = glmmTMBControl(optimizer = optim, optArgs=list(method="BFGS")))
#reg <- glmer(formula(mod), data = subset(dt, sets == "train"), family=poisson,
# control = glmerControl(optimizer = "bobyqa", optCtrl=list(maxfun=2e5)))
reg.list[[i]] <- reg # store models
# uses new model created with training data (i.e. reg) to predict and exponentiate values
predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% glmmTMB::fixef(reg)))
#predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% lme4::fixef(reg)))
Without looking at the code too carefully: glmmTMB::fixef(reg) returns a list (with elements cond (conditional model parameters), zi (zero-inflation parameters), disp (dispersion parameters) rather than a vector.
If you replace this bit with glmmTMB::fixef(reg)[["cond"]] it will probably work.

Eigenvalues for matrices in a for loop

I need to calculate eigenvalues of a series of matrices and then save them in a separate file. My data has 5 columns and 10,000 rows. I use the following functions:
R<-NULL
A <- setwd("c:/location of the file on this computer")
for(i in 0:1){
X<-read.table(file="Example.prn", skip=i*5, nrow=5)
M <- as.matrix(X)
E=eigen(M, only.values = TRUE)
R<-rbind(R,E)}
print(E)
}
As an example I have used a data set with 10 rows and 5 columns. This gives me the following results:
$`values`
[1] 1.350000e+02+0.000e+00i -4.000000e+00+0.000e+00i 4.365884e-15+2.395e-15i 4.365884e-15-2.395e-15i
[5] 8.643810e-16+0.000e+00i
$vectors
NULL
$`values`
[1] 2.362320e+02+0.000000e+00i -4.960046e+01+1.258757e+01i -4.960046e+01-1.258757e+01i 9.689475e-01+0.000000e+00i
[5] 1.104994e-14+0.000000e+00i
$vectors
NULL
I have three questions and I would really appreciate any help:
I want to save the results in consecutive rows, such as:
Eigenvalue(1) Eigenvalue(3) Eigenvalue(5) Eigenvalue(7) Eigenvalue(9)
Eigenvalue(2) Eigenvalue(4) Eigenvalue(6) Eigenvalue(8) Eigenvalue(10)
any thoughts?
Also, I don't understand the eigenvalues in the output. They are not numbers. For example, one of them is 2.362320e+02+0.000000e+00i. My first though was that this is the sum of five determinants for a 5x5 matrix. However, "2.362320e+02+0.000000e+00i" seems to only have four numbers in it. Any thoughts? Doesn't eigen() function calculate the final values of eigenvalues?
how can I save my outcome on an Excel file? I have used the following codes
However, the result I get from the current codes are:
> class(R)
[1] "matrix"
> print(R)
values vectors
E Complex,5 NULL
E Complex,5 NULL
I think, you can easily get values by the following code:
R<-NULL
A <- setwd("c:/location of the file on this computer")
for(i in 0:1){
X<-read.table(file="Example.prn", skip=i*5, nrow=5)
M <- as.matrix(X)
E=eigen(M, only.values = TRUE)
R<-rbind(R,E$values)}
}
and then use the answer of this question, to save R into a file

Extrapolating variance components from Weir-Fst on Vcftools

vcftools --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf --weir-fst-pop POP1.txt --weir-fst-pop POP2.txt --out fst.POP1.POP2
The above script computes Fst distances on 1000 Genomes population data using Weir and Cokerham's 1984 formula. This formula uses 3 variance components, namely a,b,c (between populations; between individuals within populations; between gametes within individuals within populations).
The output directly provides the result of the formula but not the components that the program calculated to arrive at the final result. How can I ask Vcftools to output the values for a,b,c?
If you can get the data into the format for hierfstat, you can get the variance components from varcomp.glob. What I normally do is:
use vcftools with --012 to get genotypes
convert 0/1/2/-1 to hierfstat format (eg., 11/12/22/NA)
load the data into hierfstat and compute (see below)
R example:
library(hierfstat)
data = read.table("hierfstat.txt", header=T, sep="\t")
levels = data.frame(data$popid)
loci = data[,2:ncol(data)]
res = varcomp.glob(levels=levels, loci=loci, diploid=T)
print(res$loc)
print(res$F)
Fst for each locus (row) therefore is (without hierarchical design), from res$loc: res$loc[1]/sum(res$loc). If you have more complicated sampling, you'll need to interpret the variance components differently.
--update per your comment--
I do this in Pandas, but any language would do. It's a text replacement exercise. Just get your .012 file into a dataframe and convert as below. I read in row by row into numpy b/c I have tons of snps, but read_csv would work, too.
import pandas as pd
import numpy as np
z12_data = []
for i, line in enumerate(open(z12_file)):
line = line.strip()
line = [int(x) for x in line.split("\t")]
z12_data.append(np.array(line))
if i % 10 == 0:
print i
z12_data = np.array(z12_data)
z12_df = pd.DataFrame(z12_data)
z12_df = z12_df.drop(0, axis=1)
z12_df.columns = pd.Series(z12_df.columns)-1
hierf_trans = {0:11, 1:12, 2:22, -1:'NA'}
def apply_hierf_trans(series):
return [hierf_trans[x] if x in hierf_trans else x for x in series]
hierf = df.apply(apply_hierf_trans)
hierf.to_csv("hierfstat.txt", header=True, index=False, sep="\t")
Then, you'd read that file hierfstat.txt into R, these are your loci. You'd need to specify your levels in your sampling design (e.g., your population). Then call varcomp.glob() to get the variance components. I have a parallel version of this here if you want to use it.
Note that you are specifying 0 as the reference allele, in this case. May be what you want, maybe not. I often calculate minor allele frequency and make 2 the minor allele, but it depends on your study goal.

Resources