Writing arbitrary R objects to SQLite database - rsqlite

I'm trying to store large list objects created in R to an SQLite database via RSQLite. Since These list objects contain several 2d and 3d matrices, I'd like store them as individual entries. I read serializing these and storing them as blobs does the trick.
The Problem is however that my code does not appear to store the blobs as individual rows, instead rather storing each separate byte as row. Here is my code:
library(RSQLite)
out1 <- serialize(model1,NULL)
out2 <- serialize(model2,NULL)
out3 <- serialize(model3,NULL)
model4 <- serialize(rnorm(10),NULL)
model5 <- serialize(rnorm(20),NULL)
model6 <- serialize(rnorm(30),NULL)
db <- dbConnect(SQLite(), dbname="Test.sqlite")
dbGetQuery(conn = db,
"CREATE TABLE IF NOT EXISTS models
(_id INTEGER PRIMARY KEY AUTOINCREMENT,
model BLOB)")
test4 <- data.frame(g=I(model4))
test5 <- data.frame(g=I(model5))
test6 <- data.frame(g=I(model6))
dbGetPreparedQuery(db, "INSERT INTO models (model) values (:g)", bind.data=test4)
dbGetPreparedQuery(db, "INSERT INTO models (model) values (:g)", bind.data=test5)
dbGetPreparedQuery(db, "INSERT INTO models (model) values (:g)", bind.data=test6)
dbListTables(db)
p1 = dbGetQuery( db,'select * from models' )
Also, while the writing process works fine in this case, it is incredibly slow with files larger than 1000kb...

Related

Shiny app - aggregate data set by selection_filter and create new variables

I am quite new to R and have been trying to find a solution for my problem since weeks. I hope someone can help me.
1.I want to develop a shiny app in a dashboard, where the user can select values via selection_filter (e.g. out of the variable "age group" the value "40-49 years" and from "sex" the value "female"). Based on these selections, columns (e.g. column x,y, and z) from the original dataset will be aggregated. I already wrote a function using aggregate().
2.Based on the aggregated columns, new values shall be calculated (e.g. d=(x-y)/(z/2)).
3.The aggregated columns and the newly calculated values shall be displayed in a table to the user.
The function from 1)
aggreg.function <- function(a,b,c) {
agg.data<- aggregate(cbind(x,y,z), shared_Cervix, sum,
subset=c(!AgeGroup %in% a & !Sex %in% b & !Edition %in% c))
#Calculate new values
agg.data$d<- agg.data$x+agg.data$y
agg.data$f<- (agg.data$x+agg.data$y)/(agg.data$z/2)
View(m.agg.data)
}
user_data<- reactive({
aggreg.function(input$AgeGroup, input$Sex, input$Edition)
})
EDIT
Thanks for the recommendations. I change my code but now I struggle a bit with adding new columns. In total, I want to insert 17 new columns based on the filtered table (data_step2()). Is there a way to insert multiple columns at the same time. In my example: is it possible to combine data_step3 and data_step4?
ui<-fluidPage(
selectInput("Age","Age:",sort(unique(Complete_test$Age))),
selectInput("Raced","Race:",sort(unique(Complete_test$Race))),
selectInput("Stage","Stage:",sort(unique(Complete_test$Stage))),
selectInput("Grade","Grade:",sort(unique(Complete_test$Grade))),
selectInput("Edition","Edition:",sort(unique(Complete$Edition))),
DT::dataTableOutput("filtered.result")
)
server = function(input, output) {
data_step1 <- reactive({
Complete%>% filter(Age %in% input$Age & Stage %in% input$Stage & Grade %in% input$Grade & Race %in% input$Race & Edition%in% input$_Edition)})
data_step2 <- reactive({
data_step1() %>% group_by(Age, Stage, Grade, Race, Edition, Year ) %>% summarise(across(everything(), sum))
})
#is it possible to combine data_step3 and data_step_4 ?
data_step3 <- reactive({
data_step2() %>% mutate(xy=x+y)
})
data_step4 <- reactive({
data_step3() %>% mutate(w=xy/(x2))
})
output$filtered.result <- DT::renderDataTable({
data_step4()
})
}
shinyApp(ui, server)
```
Here's something you might be able to do for step 1 of your question, although it's hard to tell what your data might look like and what you end result should look like. I'm assuming that you want to allow a user of your dashboard to select the AgeGroup and Sex variables to view data, so in the UI of the application, two selectInput functions were used to provide that functionality. For the server side, two reactive statements are used to filter the data as a user changes the inputs. In each one, an input is required and then the dataset is filtered by the input. Notice in the second reactive statement, "data_step1()" is used instead of "data_step1"; this allows the reactivity of these statements to continue to be updated as a user changes inputs.
For step 2 and 3, you'll want to use the "data_step2()" dataset to add the new columns and then you can use a function such as renderDataTable to display the output on your dashboard.
library(shiny)
library(dplyr)
ui <- fluidPage(
selectInput("AgeGroup", "AgeGroup:", sort(unique(Data$AgeGroup))),
selectInput("Sex", "Sex:", sort(unique(Data$Sex)))
)
server <- function(input,output,session){
data_step1 <- reactive({
req(input$AgeGroup) # Require input
data %>% filter(AgeGroup %in% input$AgeGroup)}) # Filter full dataset by AgeGroup input
data_step2 <- reactive({
req(input$Sex) # Require input
data_step1() %>% filter(Sex %in% input$Sex)}) # Filter full dataset by Sex input
# Do next steps with the filtered data_step2() dataset...
}
# Run the application
shinyApp(ui = ui, server = server)

How do I add specific values of columns to create new columns?

I have a dataset which I want to format in order to perform repeated measures anova. My dataset is of the form:
set.seed(32)
library(tibble)
id<- rep(1:2,each=3)
y_0 <- rep(rnorm(2,mean=50,sd=10),each=3)
time <- rep(c(1,2,3),times=2)
c<-rep(rnorm(2,mean=10,sd=12),each=3)
data <- tibble(id,y,t,c)
I want to bring the dataset in the form of a dataset for repeated measures anova meaning I want to have only one value for id in each column and create 3 more columns. One for y+c in time 1 named y_1,y+c in time 2 named y_2 and y+c in time 3 named y_3. Can anyone provide some assistance?

How to understand part and partition of ClickHouse?

I see that clickhouse created multiple directories for each partition key.
Documentation says the directory name format is: partition name, minimum number of data block, maximum number of data block and chunk level. For example, the directory name is 201901_1_11_1.
I think it means that the directory is a part which belongs to partition 201901, has the blocks from 1 to 11 and is on level 1. So we can have another part whose directory is like 201901_12_21_1, which means this part belongs to partition 201901, has the blocks from 12 to 21 and is on level 1.
So I think partition is split into different parts.
Am I right?
Parts -- pieces of a table which stores rows. One part = one folder with columns.
Partitions are virtual entities. They don't have physical representation. But you can say that these parts belong to the same partition.
Select does not care about partitions.
Select is not aware about partitioning keys.
BECAUSE each part has special files minmax_{PARTITIONING_KEY_COLUMN}.idx
These files contain min and max values of these columns in this part.
Also this minmax_ values are stored in memory in a (c++ vector) list of parts.
create table X (A Int64, B Date, K Int64,C String)
Engine=MergeTree partition by (A, toYYYYMM(B)) order by K;
insert into X values (1, today(), 1, '1');
cd /var/lib/clickhouse/data/default/X/1-202002_1_1_0/
ls -1 *.idx
minmax_A.idx <-----
minmax_B.idx <-----
primary.idx
SET send_logs_level = 'debug';
select * from X where A = 555;
(SelectExecutor): MinMax index condition: (column 0 in [555, 555])
(SelectExecutor): Selected 0 parts by date
SelectExecutor checked in-memory part list and found 0 parts because minmax_A.idx = (1,1) and this select needed (555, 555).
CH does not store partitioning key values.
So for example toYYYYMM(today()) = 202002 but this 202002 is not stored in a part or anywhere.
minmax_B.idx stores (18302, 18302) (2020-02-10 == select toInt16(today()))
In my case, I had used groupArray() and arrayEnumerate() for ranking in Populate. I thought that Populate can run query with new data on the partition (in my case: toStartOfDay(Date)), the total sum of new inserted data is correct but the groupArray() function is doesn't work correctly.
I think it's happened because when insert one Part, CH will groupArray() and rank on each Part immediately then merging Parts in one Partition, therefore i wont get exactly the final result of groupArray() and arrayEnumerate() function.
Summary, Merge
[groupArray(part_1) + groupArray(part_2)] is different from
groupArray(Partition)
with
Partition=part_1 + part_2
The solution that i tried is insert new data as one block size, just like using groupArray() to reduce the new data to the number of rows that is lower than max_insert_block_size=1048576. It did correctly but it's hard to insert new data of 1 day as one Part because it will use too much memory for querying when populating the data of 1 day (almost 150Mn-200Mn rows).
But do u have another solution for Populate with groupArray() for new inserting data, such as force CH to use POPULATE on each Partition, not each Part after merging all the part into one Partition?

Sending Items to specific partitions

I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD
Lets say I have two RDDs of key-value pairs
val a:RDD[(Int, Foo)]
val b:RDD[(Int, Foo)]
val aStructure = a.reduceByKey(//reduce into large data structure)
b.mapPartitions{
iter =>
val usefulItem = aStructure(samePartitionKey)
iter.map(//process iterator)
}
How could I go about setting up the Partition such that the specific data structure I need will be present for the mapPartition but I won't have the extra overhead of sending over all values (which would happen if I were to make a broadcast variable).
One thought I have been having is to store the objects in HDFS but I'm not sure if that would be a suboptimal solution.
Another thought I am currently exploring is whether there is some way I can create a custom Partition or Partitioner that could hold the data structure (Although that might get too complicated and become problematic)
thank you for your help!
edit:
Pangea makes a very good point that I should offer some more specifics. Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large.
My hope is to do a MapPartitions within the RDD of vectors where I can compare each vector to the inverted index. The issue is that I only NEED one inverted index object per partition and doing a join would cause me to have a lot of copies of that index.
val vectors:RDD[(Int, SparseVector)]
val invertedIndexes:RDD[(Int, InvIndex)] = a.reduceByKey(generateInvertedIndex)
vectors:RDD.mapPartitions{
iter =>
val invIndex = invertedIndexes(samePartitionKey)
iter.map(invIndex.calculateSimilarity(_))
)
}
A Partitioner is a function that, given a generic element, will return in which partition it belongs. It also decides the number of partitions.
There's a form of reduceByKey that takes a partitioner as an argument.
If I am understanding correctly your question, you want the data be partitioned while doing the reduce.
See the example:
// create example data
val a =sc.parallelize(List( (1,1),(1,2), (2,3),(2,4) ) )
// create simple sample partitioner - 2 partitions, one for odd
// one for even key.hashCode. You should put your partitioning logic here
val p = new Partitioner { def numPartitions: Int = 2; def getPartition(key:Any) = key.hashCode % 2 }
// your reduceByKey function. Sample: just add
val f = (a:Int,b:Int) => a + b
val rdd = a.reduceByKey(p, f)
// here your rdd will be partitioned the way you want with the number
// of partitions you want
rdd.partitions.size
res8: Int = 2
rdd.map() .. // go on with your processing

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Resources