tidyverse, purrr, not able to keep labels

tidyverse, purrr, not able to keep labels - sorting

I'm teaching myself tidyverse, and working on purrr at the moment. For example, using map_dbl() to find the variance of the features of a data set is clear. This returns a list. For example:
library(tidyverse)
diamonds %>%
select_if(is.numeric) %>%
map_dbl(var)
The result looks like:
So far so good. But if I want to do anything (such as sort the results), the labels are lost. For example:
select_if(is.numeric) %>%
map_dbl(var) %>%
as_tibble() %>%
arrange(value)
This loses the labels. For example:
How is this done in the tidyverse a way that keeps the labels?

You can do this which keeps it as a tibble throughout.
It's worth reading about tidy data, where each variable must have its own column and
each observation must have its own row.
library(tidyverse)
diamonds %>%
summarise(across(where(is.numeric), var)) %>%
pivot_longer(everything()) %>%
arrange(value)
#> # A tibble: 7 × 2
#> name value
#> <chr> <dbl>
#> 1 carat 0.225
#> 2 z 0.498
#> 3 x 1.26
#> 4 y 1.30
#> 5 depth 2.05
#> 6 table 4.99
#> 7 price 15915629.
Created on 2022-04-30 by the reprex package (v2.0.1)

Related

Issue with Nested for loop in R

I am trying to practice with R by reproducing an algorithm they gave to us in class for quantitative systems performances analysis. The output are Queue length (Q), Throughput (X) and service time (R) for a certain number of items (n) in the system and a certain number of machines (k).
I started with a simplified version when the number of machines=1 and the code is working.
N1 <-c(1,2,3)
K1 <- 1
Q <- 0
R <- 0
D <- 3 # service rate of the machine
for(z in 1:length(N1))
{if(z==1){R[z] <-D} else{R[z] <- 3*(1+Q[z-1])}
X<- z/R[z];
Q[z] <- X*R[z]}
Then, i tried for 4 machines. D stand for the service rate of each machine. So i created a nested for loop. The code is the following.
N1 <-c(1,2)
K <- c(1,2,3,4)
D <- c(3,4,3,6)
Q <- 0
R <- 0
X <-0
for(z in 1:length(N1))
{for(k in 1:length(K))
{if(z==1){R[k,z] <-D[k]} else{R[k,z] <- D[k]*(1+Q[k,z-1])}
X[z]<- z/sum(R[z]);
if(z==1){Q[k,z] <- X[z]*R[k,z]} else {Q[k,z] <- X[z]*R[k,z]}
}}
Although i fixed z==1, i get an error saying : "Error in R[k, z] <- D[k] : incorrect number of subscripts on matrix"
I am not sure how to proceed and i would appreciate any help. Just le me know in case more details are needed. Thanks very much.

You have to allocate the 2-dimensional matrix R
Use:
R <- matrix(nrow=length(K), ncol=length(N1))
Instead of:
R <- 0

Is a list comprehension or a sequential filter more optimized?

Let's say you have to return the sum of all the multiples of 2 and 3 in a set of integers from 1-100. In Haskell, the code I would write would look something like this:
sum ([x*2 | x<-[1..100], x*2 < 100] `union` [x*3 | x<-[1..100], x*3 < 100])
This uses 2 list comprehensions with a union. Another solution would be to step through each item in the list and evaluate it (using a modulus), then add it to a separate list, which you would later add together.
Both of these solutions come out with the same answer, but which one is more optimized if you had to do the same for, say, a list from 1..1000000?
The answer to the original question is 3317 if you want to create your own algorithm.

If you are looking for performance, you can simplify this problem to the point where you don't even need a computer....
Numbers divisible by 2 or 3 fall into a pattern
0 (1) 2 3 4 (5).... 6 (7) 8 9 10 (11).... etc
or
TFTTTF.... TFTTTF....
Assume that the max bound is divisible by 6, (if not, you can just choose the highest value below the real bound and add the remaining few values by hand). Let maxBound=6*N.
For each additional N, you add the following values
6*n, 0, 6*n+2, 6*n+3, 6*n+4, 0
which sums to
24*n+9
so all you need to do is sum up
sum from n=0 to N of (24*n+9)
=24*(sum from n=0 to N of n) + 9*N
=24*N*(N-1)/2 + 9*N
=12*N^2-3*N
so a very fast Haskell program that would solve this problem would look something like this
f maxBound = 12*n^2-3*n + remainingStuff
where n = maxBound `quot` 6
remainingStuff = sum $ filter (<= maxBound) [6*n, 6*n+2, 6*n+3, 6*n+4]

The union function is a "quadratic" algorithm, so using one list comprehension will be faster.
A better way which is useful for generating these kinds of sequences is to take advantage of the fact that they are ordered and merge them together with a function like:
merge :: [Int] -> [Int] -> [Int]
merge as [] = as
merge [] bs = bs
merge as#(a:at) bs#(b:bt) =
case compare a b of
LT -> a : merge at bs
EQ -> a : merge at bt
GT -> b : merge as bt
and then generate your sequence with:
[ x | x <- merge [2,4..100] [3,6..100] ]
One last tip for writing combinatorial loops... replace expressions like x <- [1..100], 2*x < 100 with x <- [1..49], or if you can't compute the upper bound explicitly, use x <- takeWhile (\x -> 2*x < 100) [1..100]. The latter forms only generates as many items as needed.

In R, equivalent of upper_bound()

I have two vectors of dates A and B. I would like for each date of A, obtain the closest date in B which is greater or superior to the date itself.
(the reason is that B is a list of working days and I need my output to be a working day).
How can I do this in a vectorised manner in R ?

This should work, though there might be a vectorized solution:
sapply(d1, function(x) min(d2[d2>x]))

+1 to blindJesse for the apply logic, but care needs to be given to the case where no nearest date exists as well as the the coercion to numeric that occurs:
a <- as.Date(sample(1:20, 5, T), origin=Sys.Date())
#[1] "2012-08-26" "2012-08-31" "2012-08-25" "2012-08-18" "2012-08-20"
b <- as.Date(sample(1:20, 5, T), origin=Sys.Date())
#[1] "2012-08-27" "2012-08-27" "2012-08-25" "2012-08-22" "2012-08-17"
sapply(a, function(x) min(b[b>x]))
#[1] 15579 Inf 15579 15574 15574
# generate the min index instead, catching for no min case
min.indices <- sapply(a, function(x) {
ifelse(length(which.min(b[b>x]))==0, NA, which.min(b[b>x]))
})
b[min.indices]
#[1] "2012-08-27" NA "2012-08-27" "2012-08-22" "2012-08-22"

Most efficient list to data.frame method?

Just had a conversation with coworkers about this, and we thought it'd be worth seeing what people out in SO land had to say. Suppose I had a list with N elements, where each element was a vector of length X. Now suppose I wanted to transform that into a data.frame. As with most things in R, there are multiple ways of skinning the proverbial cat, such as as.dataframe, using the plyr package, comboing do.call with cbind, pre-allocating the DF and filling it in, and others.
The problem that was presented was what happens when either N or X (in our case it is X) becomes extremely large. Is there one cat skinning method that's notably superior when efficiency (particularly in terms of memory) is of the essence?

Since a data.frame is already a list and you know that each list element is the same length (X), the fastest thing would probably be to just update the class and row.names attributes:
set.seed(21)
n <- 1e6
x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
x <- c(x,x,x,x,x,x)
system.time(a <- as.data.frame(x))
system.time(b <- do.call(data.frame,x))
system.time({
d <- x # Skip 'c' so Joris doesn't down-vote me! ;-)
class(d) <- "data.frame"
rownames(d) <- 1:n
names(d) <- make.unique(names(d))
})
identical(a, b) # TRUE
identical(b, d) # TRUE
Update - this is ~2x faster than creating d:
system.time({
e <- x
attr(e, "row.names") <- c(NA_integer_,n)
attr(e, "class") <- "data.frame"
attr(e, "names") <- make.names(names(e), unique=TRUE)
})
identical(d, e) # TRUE
Update 2 - I forgot about memory consumption. The last update makes two copies of e. Using the attributes function reduces that to only one copy.
set.seed(21)
f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
f <- c(f,f,f,f,f,f)
tracemem(f)
system.time({ # makes 2 copies
attr(f, "row.names") <- c(NA_integer_,n)
attr(f, "class") <- "data.frame"
attr(f, "names") <- make.names(names(f), unique=TRUE)
})
set.seed(21)
g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
g <- c(g,g,g,g,g,g)
tracemem(g)
system.time({ # only makes 1 copy
attributes(g) <- list(row.names=c(NA_integer_,n),
class="data.frame", names=make.names(names(g), unique=TRUE))
})
identical(f,g) # TRUE

This appears to need a data.table suggestion given that efficiency for large datasets is required. Notably setattr sets by reference and does not copy
library(data.table)
set.seed(21)
n <- 1e6
h <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
h <- c(h,h,h,h,h,h)
tracemem(h)
system.time({h <- as.data.table(h)
setattr(h, 'names', make.names(names(h), unique=T))})
as.data.table, however does make a copy.
Edit - no copying version
Using #MatthewDowle's suggestion setattr(h,'class','data.frame') which will convert to data.frame by reference (no copies)
set.seed(21)
n <- 1e6
i <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n))
i <- c(i,i,i,i,i,i)
tracemem(i)
system.time({
setattr(i, 'class', 'data.frame')
setattr(i, "row.names", c(NA_integer_,n))
setattr(i, "names", make.names(names(i), unique=TRUE))
})

Optimizing Haskell code

I'm trying to learn Haskell and after an article in reddit about Markov text chains, I decided to implement Markov text generation first in Python and now in Haskell. However I noticed that my python implementation is way faster than the Haskell version, even Haskell is compiled to native code. I am wondering what I should do to make the Haskell code run faster and for now I believe it's so much slower because of using Data.Map instead of hashmaps, but I'm not sure
I'll post the Python code and Haskell as well. With the same data, Python takes around 3 seconds and Haskell is closer to 16 seconds.
It comes without saying that I'll take any constructive criticism :).
import random
import re
import cPickle
class Markov:
def __init__(self, filenames):
self.filenames = filenames
self.cache = self.train(self.readfiles())
picklefd = open("dump", "w")
cPickle.dump(self.cache, picklefd)
picklefd.close()
def train(self, text):
splitted = re.findall(r"(\w+|[.!?',])", text)
print "Total of %d splitted words" % (len(splitted))
cache = {}
for i in xrange(len(splitted)-2):
pair = (splitted[i], splitted[i+1])
followup = splitted[i+2]
if pair in cache:
if followup not in cache[pair]:
cache[pair][followup] = 1
else:
cache[pair][followup] += 1
else:
cache[pair] = {followup: 1}
return cache
def readfiles(self):
data = ""
for filename in self.filenames:
fd = open(filename)
data += fd.read()
fd.close()
return data
def concat(self, words):
sentence = ""
for word in words:
if word in "'\",?!:;.":
sentence = sentence[0:-1] + word + " "
else:
sentence += word + " "
return sentence
def pickword(self, words):
temp = [(k, words[k]) for k in words]
results = []
for (word, n) in temp:
results.append(word)
if n > 1:
for i in xrange(n-1):
results.append(word)
return random.choice(results)
def gentext(self, words):
allwords = [k for k in self.cache]
(first, second) = random.choice(filter(lambda (a,b): a.istitle(), [k for k in self.cache]))
sentence = [first, second]
while len(sentence) < words or sentence[-1] is not ".":
current = (sentence[-2], sentence[-1])
if current in self.cache:
followup = self.pickword(self.cache[current])
sentence.append(followup)
else:
print "Wasn't able to. Breaking"
break
print self.concat(sentence)
Markov(["76.txt"])
--
module Markov
( train
, fox
) where
import Debug.Trace
import qualified Data.Map as M
import qualified System.Random as R
import qualified Data.ByteString.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train (x:y:[]) = M.empty
train (x:y:z:xs) =
let l = train (y:z:xs)
in M.insertWith' (\new old -> M.insertWith' (+) z 1 old) (x, y) (M.singleton z 1) `seq` l
main = do
contents <- B.readFile "76.txt"
print $ train $ B.words contents
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."

a) How are you compiling it? (ghc -O2 ?)
b) Which version of GHC?
c) Data.Map is pretty efficient, but you can be tricked into lazy updates -- use insertWith' , not insertWithKey.
d) Don't convert bytestrings to String. Keep them as bytestrings, and store those in the Map

Data.Map is designed under the assumption that the class Ord comparisons take constant time. For string keys this may not be the case—and when the strings are equal it is never the case. You may or may not be hitting this problem depending on how large your corpus is and how many words have common prefixes.
I'd be tempted to try a data structure that is designed to operate with sequence keys, such as for example a the bytestring-trie package kindly suggested by Don Stewart.

I tried to avoid doing anything fancy or subtle. These are just two approaches to doing the grouping; the first emphasizes pattern matching, the second doesn't.
import Data.List (foldl')
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
type Database2 = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train2 :: [B.ByteString] -> Database2
train2 words = go words M.empty
where go (x:y:[]) m = m
go (x:y:z:xs) m = let addWord Nothing = Just $ M.singleton z 1
addWord (Just m') = Just $ M.alter inc z m'
inc Nothing = Just 1
inc (Just cnt) = Just $ cnt + 1
in go (y:z:xs) $ M.alter addWord (x,y) m
train3 :: [B.ByteString] -> Database2
train3 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.alter (addWord z) (x,y) m
addWord word = Just . maybe (M.singleton word 1) (M.alter inc word)
inc = Just . maybe 1 (+1)
main = do contents <- B.readFile "76.txt"
let db = train3 $ B.words contents
print $ "Built a DB of " ++ show (M.size db) ++ " words"
I think they are both faster than the original version, but admittedly I only tried them against the first reasonable corpus I found.
EDIT
As per Travis Brown's very valid point,
train4 :: [B.ByteString] -> Database2
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
where update m (x,y,z) = M.insertWith (inc z) (x,y) (M.singleton z 1) m
inc k _ = M.insertWith (+) k 1

Here's a foldl'-based version that seems to be about twice as fast as your train:
train' :: [B.ByteString] -> Database
train' xs = foldl' (flip f) M.empty $ zip3 xs (tail xs) (tail $ tail xs)
where
f (a, b, c) = M.insertWith (M.unionWith (+)) (a, b) (M.singleton c 1)
I tried it on the Project Gutenberg Huckleberry Finn (which I assume is your 76.txt), and it produces the same output as your function. My timing comparison was very unscientific, but this approach is probably worth a look.

1) I'm not clear on your code.
a) You define "fox" but don't use it. Were you meaning for us to try to help you using "fox" instead of reading the file?
b) You declare this as "module Markov" then have a 'main' in the module.
c) System.Random isn't needed. It does help us help you if you clean code a bit before posting.
2) Use ByteStrings and some strict operations as Don said.
3) Compile with -O2 and use -fforce-recomp to be sure you actually recompiled the code.
4) Try this slight transformation, it works very fast (0.005 seconds). Obviously the input is absurdly small, so you'd need to provide your file or just test it yourself.
{-# LANGUAGE OverloadedStrings, BangPatterns #-}
module Main where
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
type Database = M.Map (B.ByteString, B.ByteString) (M.Map B.ByteString Int)
train :: [B.ByteString] -> Database
train xs = go xs M.empty
where
go :: [B.ByteString] -> Database -> Database
go (x:y:[]) !m = m
go (x:y:z:xs) !m =
let m' = M.insertWithKey' (\key new old -> M.insertWithKey' (\_ n o -> n + 1) z 1 old) (x, y) (M.singleton z 1) m
in go (y:z:xs) m'
main = print $ train $ B.words fox
fox="The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."

As Don suggested, look into using the stricer versions o your functions: insertWithKey' (and M.insertWith' since you ignore the key param the second time anyway).
It looks like your code probably builds up a lot of thunks until it gets to the end of your [String].
Check out: http://book.realworldhaskell.org/read/profiling-and-optimization.html
...especially try graphing the heap (about halfway through the chapter). Interested to see what you figure out.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

tidyverse, purrr, not able to keep labels - sorting

Related

Issue with Nested for loop in R

Is a list comprehension or a sequential filter more optimized?

In R, equivalent of upper_bound()

Most efficient list to data.frame method?

Optimizing Haskell code

Categories

Resources