Using tbl and src_monetdblite to access data - monetdb

Sorry if this question has been asked elsewhere, I can't find it. I'm working through some basic examples in MonetDBLite.
> dbGetQuery(dbcon, "SELECT MAX(mpg) FROM mtcars WHERE cyl = 8")
L3
1 19.2
works, but
> ms <- MonetDBLite::src_monetdblite("./DB")
> t <- tbl(ms, "mtcars")
Error in UseMethod("tbl") :
no applicable method for 'tbl' applied to an object of class
"c('src_monetdb', 'src_sql', 'src')"
It seems that it's trying to assign the db to t not the table.
Any suggestions would be greatly appreciated.
I've been perusing resources and found a useR2016 presentation and noticed a difference here:
> ms
src: MonetDBEmbeddedConnection
tbls: mtcars
Curious...

I'm a huge fan of using MonetDBLite together with dplyr. My addition to Hannes Mühleisen's (thanks for the package!) answer would be that it appears that the order you load the packages can matter. Loading MonetDBLite after dplyr and dbplyr seems to be the key for me. Loading MonetDBLite first causes errors similar to the one nzgwynn noted.
Sometimes I could connect to the database with no problems. Other times I would get error messages like:
Error in UseMethod("db_query_fields") : no applicable method for 'db_query_fields' applied to an object of class "MonetDBEmbeddedConnection"
Like nzgwynn, I was puzzled about why it would work sometimes but not others. Restarting and reinstalling wouldn't necessarily fix it for me.
This clue, from an issue filed about sparklyr, lead me to explore the package loading order:
https://github.com/rstudio/sparklyr/issues/38
Like noted there with sparklyr, and I've noticed with other R database packages, MonetDBLite will load and attach automatically if the Global Environment already contains a connection object. My problem was that I had an src_monetdb object in my workspace, which was causing MonetDBLite to load upon starting RStudio. So I while I thought I was loading it after dplyr and dbplyr, it was really loading first. If I clear the workspace and then restart, I can load the packages in the preferred order. So far, this method has worked.
I've seen starting with a clean workspace advised as good practice generally, e.g.: https://twitter.com/hadleywickham/status/561146907519500288. Starting with a fresh workspace loses you no time either given MonetDBLite's speedy query ability.
Lastly, I would put a enthusiastic pitch in for using MonetDBLite. I saw it mentioned on RStudio's database page and was immediately impressed on how easy it was to setup and how fast it is. It's the best way I've found for working with a ~2 GB dataset in R. When exploring the data interactively, the dplyr queries run so quickly that it feels like I'm working with the data in memory. And if all I want to do is load the whole dataset into memory, MonetDBLite is as fast or faster than other methods I've tried like read.fst() from the fst package.

I closed R and opened it again and the same coding worked fine...

You need to call library("dplyr") before using tbl and friends. Also make sure you have dbplyr installed.
Update: Also, please make sure there is no connection object (src) in a stored workspace loaded at startup. Loading connections from .Rdata files does not work! Instead, create the connection/src from scratch every time you run a script.

Related

Lazy Logical Replication with MonetDB

I'm trying to implement MonetDB in three machines, one master and two replicas in lazy logical replication.
For now I'm trying to implement in only machine with the following commands I took from this old issue in only one machine for now.
Everything goes according to plan until the first problem I have: When trying to create tables or inserting stuff I get the following errors I was not able to find on google:
Error in optimizer wlc: TypeException:user.main[17]:'wlc.predicate' undefined in: X_0:any := wlc.predicate("alpha":str, "id":str);
Error in optimizer wlc: TypeException:user.main[50]:'wlc.predicate' undefined in: X_0:any := wlc.predicate("beta":str, "id":str);
Error in optimizer wlc: TypeException:user.main[77]:'wlc.depend' undefined in: X_0:any := wlc.depend("beta":str, X_1:lng);
I got around this by setting optpipe to minimal_pipe but I wanted to know why this is happening so I don't have to do this.
The second problem I have when trying CALL wlr.replicate:
Perhaps a missing wlr.master() call.
How do I correctly set-up replication?
Thanks in advance.
The wlc/wlr features are experimental and de facto deprecated in current releases of MonetDB and completely removed starting from the next major release. Replication in MonetDB is a topic currently under revision. You might be better off formulating a feature request on MonetDB's githup page.
You might also consider looking into the concepts of replicate and remote tables. But those are definitely not solutions by themselves and if used as such, implement replication on the SQL layer instead of the infrastructural layer.
But on the short run, I do not expect that the open source community can help you out here much. Consider commercial support otherwise if feasible.

Datastore fetch VS fetch(keys_only=True) then get_multi

I am fetching multiple entities 100+ from datastore using the below Query
return entity.query(ancestor = ancestorKey).filter(entity.year= myStartYear).order(entity.num).fetch()
Which was taking a long time (order of a few seconds) to load.
Trying to find an optimum way, I created exactly 100 entities, found that it takes anywhere between 750ms ~ 1000ms to fetch the 100 entities on local server, which is a lot of course. I am not sure how to get around a single line fetch to make it more efficient!
In a desperate attempt to optimize, I tried
Removing the order part, still got the same results
Removing the filter part, still got the same results
Removing the order & filter part, still got the same results
So apparently it is something else. In a desperate attempt, I tried fetching for keys only then passing the keys to ndb.get_multi() function:
qKeys = entity.query(ancestor = ancestorKey).filter(entity.year= myStartYear).order(entity.num).fetch(keys_only=True)
return ndb.get_multi(qKeys)
To my surprise I get a better throughput! query results now loads in 450 ~ 550ms which is around ~40% better performance on average!
I am not sure why this happens, I would have thought that the fetch function already queries entities in the most optimum time.
Question:
Any idea how I can optimize the single query line to load faster?
Side Question:
Anyone knows what's the underlying mechanism for the fetch function, and why fetching keys only, then using ndb.get_multi() is faster?
FWIW, you shouldn't expect meaningful results from datastore performance tests performed locally, using either the development server or the datastore emulator - they're just emulators, they don't have the same performance (or even the 100% equivalent functionality) as the real datastore.
Credit goes to #snakecharmerb, who correctly identified the culprit, confirmed by OP:
Be aware that performance characteristics in the cloud may differ from
those on your local machine. You really want to be running these tests
in the cloud. – snakecharmerb yesterday
#snakecharmerb you were right on your suggestion! Just tested on the
cloud it's actually the other way around on the cloud in terms of
performance. fetch() ~550ms, fetch(keysonly) then get_multi was ~700ms
seems that fetch() works better on the cloud! – Khaled yesterday

How to do Lazy Map deserialization in Haskell

Similar to this question by #Gabriel Gonzalez: How to do fast data deserialization in Haskell
I have a big Map full of Integers and Text that I serialized using Cerial. The file is about 10M.
Every time I run my program I deserialize the whole thing just so I can lookup an handful of the items. Deserialization takes about 500ms which isn't a big deal but I alway seem to like profiling on Friday.
It seems wasteful to always deserialize 100k to 1M items when I only ever need a few of them.
I tried decodeLazy and also changing the map to a Data.Map.Lazy (not really understanding how a Map can be Lazy, but ok, it's there) and this has no effect on the time except maybe it's a little slower.
I'm wondering if there's something that can be a bit smarter, only loading and decoding what's necessary. Of course a database like sqlite can be very large but it only loads what it needs to complete a query. I'd like to find something like that but without having to create a database schema.
Update
You know what would be great? Some fusion of Mongo with Sqlite. Like you could have a JSON document database using flat-file storage ... and of course someone has done it https://github.com/hamiltop/MongoLiteDB ... in Ruby :(
Thought mmap might help. Tried mmap library and segfaulted GHCI for the first time ever. No idea how can even report that bug.
Tried bytestring-mmap library and that works but no performance improvement. Just replacing this:
ser <- BL.readFile cacheFile
With this:
ser <- unsafeMMapFile cacheFile
Update 2
keyvaluehash may be just the ticket. Performance seems really good. But the API is strange and documentation is missing so it will take some experimenting.
Update 3: I'm an idiot
Clearly what I want here is not lazier deserialization of a Map. I want a key-value database and there's several options available like dvm, tokyo-cabinet and this levelDB thing I've never seen before.
Keyvaluehash looks to be a native-Haskell key-value database which I like but I still don't know about the quality. For example, you can't ask the database for a list of all keys or all values (the only real operations are readKey, writeKey and deleteKey) so if you need that then have to store it somewhere else. Another drawback is that you have to tell it a size when you create the database. I used a size of 20M so I'd have plenty of room but the actual database it created occupies 266M. No idea why since there isn't a line of documentation.
One way I've done this in the past is to just make a directory where each file is named by a serialized key. One can use unsafeinterleaveIO to "thunk" the deserialized contents of each read file, so that values are only forced on read...

executePackage seems to take a long time to launch subpackage

I am a relative beginner at SSIS so I may be doing something silly.
I have a process that involves looping over a heterogenous queue and processing the objects 1 at a time. The process is currently being done in 'set logic' and its dropping stuff. I was asked to rework it in a looping manner, so that decision has been made for me.
I have chosen to implement queue logic in 1 package and the actual processing in another package.
This is all going relatively well considering...
I now have the process up and running, but its slow. 9 seconds per item. Clearly I cant present this solution. :-)
One thing i notice, 1.5 - 2 seconds of each loop are on the ExecutePackage Task in the queue loop.
I cant figure out how to get a hard number, I am using the flashing green box method of performance tuning. The other steps seem to be very fast. Adding indexes, changing sql to sps, all the usual tricks have helped.
Is the UI realiable at all with regards to boxes turning white/yellow/green? Some tasks report times in the progress tab, some dont seem to. So I am counting yellow time.
Should calling a subpackage be that expensive? 1 change i made was I change 'RunInASeparateProcess' to FALSE. I did that because the subpackage produces the following message otherwise:
Error: 0xC0012024 at Script Task: The task "Script Task" cannot run on this edition of Integration Services. It requires a higher level edition.
Task failed: Script Task
The reading i have done seems to advocate multiple packages. Anyone have any counter patterns? Should i stay the course? I started changing to 1 package. Copy/paste doesnt seem to work well w/ SequenceContainers. I would also need to recreate all the variables in the parent package. Doable, but im not sure that is the answer.
Does anyone know of any tuning resources/websites/books they would be willing to share.
Update - I have been tearing things down in an effort to figure out what the problem is. I was thinking it was the package configurations passing variable values. I dont think that is it. I can pass variables to another package w/ nothing in it and it is fast.
I can make the trivial subpackage slow by adding the two connection managers to it.
I suddenly realize I may be making and breaking a connection to both an Oracle Server and a SQL server in both the main package and then the sub package.
Am I correct in this observation?
Is there any way I can reuse the connection between the two packages?
When i google it, most of what i see is suggestions for passing the connection string.
UPDATE - I combined the two packages into one. This performance is not about 1.25 seconds per item, down from about 9. the only thing i can point to that changed is i am now reusing a single connection instead of making multiple connections.
Thanks, I appreciate any help you are kind enough to offer.
Greg
Once you enable logging, I'd suggest running the package from a command window using dtexec. While that doesn't perfectly duplicate the server environment, it does have the advantages of (a) eliminating BIDS as a potential performance issue and (b) being something you can do without jumping through change control hoops.

Ado.net performance:What does SNIReadSync do?

We have a query that takes 2 seconds to run in Sql Server Management Studio but it takes 13 seconds to be shown on a client screen.
I used dotTrace to profile my source code and noticed there is this SNIReadSync method (part of ADO.net assemblies)that takes a lot of time to do its job(9 seconds).I ran my source over server so I could omit the network effects and the result was the same.
It doesn't matter if I'm using OleDBConnection or SqlConnection.
It doesn't matter if I'm using a DataReader or a DataSet.
Connection pooling does not solve this issue(as my result shows).
I googled this issue and I couldn't find an answer to the question that what this method is actually doing and how we can improve it.
here's what I found on StakOverFlow that's not helpful either:
https://stackoverflow.com/questions/1610874/snireadsync-executing-between-120-500-ms-for-a-simple-query-what-do-i-look-for
Ignoring SNIReadSync for a moment (I think this might be a red herring).
The symptoms you are describing sound like an incorrectly cached query plan.
Please update your statistics (or rebuild indexes) and see if it still occurs.

Resources