Using Cassandra Secondary Index with Hadoop

Using Cassandra Secondary Index with Hadoop - hadoop

Cassandra 1.1 documentation says that now it is possible to use secondary indexes to get a slice of rows for Hadoop processing. Does this mean that now it is possible to achieve this while using RandomPartitioner unlike earlier version where usage of OrderedPartitioner was required for this? However going through ColumnFamilyInputformat code I still see an assertion where it enforces that OrderedPartitioner need to be in place for this to happen. Any ideas on this?

Related

Lazy Logical Replication with MonetDB

I'm trying to implement MonetDB in three machines, one master and two replicas in lazy logical replication.
For now I'm trying to implement in only machine with the following commands I took from this old issue in only one machine for now.
Everything goes according to plan until the first problem I have: When trying to create tables or inserting stuff I get the following errors I was not able to find on google:
Error in optimizer wlc: TypeException:user.main[17]:'wlc.predicate' undefined in: X_0:any := wlc.predicate("alpha":str, "id":str);
Error in optimizer wlc: TypeException:user.main[50]:'wlc.predicate' undefined in: X_0:any := wlc.predicate("beta":str, "id":str);
Error in optimizer wlc: TypeException:user.main[77]:'wlc.depend' undefined in: X_0:any := wlc.depend("beta":str, X_1:lng);
I got around this by setting optpipe to minimal_pipe but I wanted to know why this is happening so I don't have to do this.
The second problem I have when trying CALL wlr.replicate:
Perhaps a missing wlr.master() call.
How do I correctly set-up replication?
Thanks in advance.

The wlc/wlr features are experimental and de facto deprecated in current releases of MonetDB and completely removed starting from the next major release. Replication in MonetDB is a topic currently under revision. You might be better off formulating a feature request on MonetDB's githup page.
You might also consider looking into the concepts of replicate and remote tables. But those are definitely not solutions by themselves and if used as such, implement replication on the SQL layer instead of the infrastructural layer.
But on the short run, I do not expect that the open source community can help you out here much. Consider commercial support otherwise if feasible.

Storing HDFS data only on specific nodes in a Hadoop Cluster

We have a 30 nodes production cluster. We want to add 5 data nodes for additional storage to handle the interim spike of data( around 2 TB). This data is to be stored temporarily and we want to get rid of it after 15 days.
Is it possible to make sure that the interim data (2 TB) coming in will be stored only on the newly added data nodes?
I am looking for something similar to YARN node labelling.
Thank you in advance.

Unfortunately I don't know a simple way to achieve this in the same HDFS cluster.
But I think you can achieve this behavior by implementing a custom "Block Placement Policy".
However, performing this task can be somewhat risky and complex.
Here is the HDFS jira ticket where this functionality is defined/added that allows you to customize this policy (JIRA TICKET).
You can read here the current behavior of choosing datanode to understand you better if you want to customize it:
link 1
Also here you can find a post with several references that can be useful on how to implement a custom policy and the risks of it:
post
Other readings that I recommend if you want to take this way:
link 2
post 2
This is a good paper about an experiment with a custom block placement policy to place replicas in SSD or HDD (Hybrid cluster):
paper
I think that if possible, it will be simpler to use a second cluster. E.g. you can eval ViewFS that uses namespaces to reference each cluster:
viewFs reference
link 3
Regards,

How do I get Sphinx stats?

I am new to Sphinx and want to make it index a 2 million row table (~1.5GB in size). I will use plain indexes.
At the moment, I don't know how much memory should I put in the mem_limit config. My idea is that I could simply keep the default, and then I could see how many results are being swapped (stay in disk) or expired (how frequently used results in memory go to disk).
I'm not sure yet exactly how Sphinx works, anyway, but this is my understanding for now. However, how can I see stats like these, just like we can see the STATS for Memcached?
Having some kind of stats would definitely help me know how to better tune Sphinx for my application.
In case it's relevant, I use MariaDB and PHP on CentOS.

In case its not clear mem_limit is ONLY used by the indexer program. For use during creating the index.
... frankly the setting isnt all that critical. Just set it as high as you have available memory.
It's not applicable to searchd which actually answers queries.
There is 'SHOW STATUS' command, but it doesnt really have anything about menory
http://sphinxsearch.com/docs/current.html#sphinxql-show-status
... memory usage (and there are no variables to control it!) can be got from general OS commands. On linux for example, possibly something like memstat

HBase - hotspotting check

I am using HBase. And I am suspecting that rowkey has caused hotspoting. Before trying with salting of rowkey, I would like to check if hotspoting has already occurred. Is there any way in HBase to analyze data distribution in region servers to check if hotspoting has occurred?
Thanks,
Partha

You can use the HMaster Info Web UI to detect this.
It should be http://master-address:16010 by default.
If it's not available, you can check if the UI is not disabled in the conf (hbase-site.xml) and be sure that hbase.master.info.port is not set to -1.
When you are on it, you have to click on the table that you want to check.
You will be on this page
https://docs.prediction.io/images/cloudformation/hbase-32538c47.png
Then if you see that one region server has a lot more regions than the others, this is a good hint that one of your region server is probably hotspotted.
It means that the regions in this part of the rowkey scope are splitted more often ! The request per second can also be an indicator but to my experience, it's not always really accurate.
But this is just good hints and the only simple good way that I know to be sure that a hotspot is occuring is to bench it. Because when it happens, the write performance are really, REALLY different. So, you should check the througput that you have with an hashed rowkey with the same data then compare. You'll see very quickly if there is an hotspot.

Usage of RemoteCache with DeltaAware and Delta interface infinispan

I need some guidance related to the following scenario in infinispan. Here is my scenario:
1) I created two nodes and started successfully in infinispan using client server mode.
2) In the hot rod client I created a remotechachemanager and then obtained a RemoteCache.
3) In the remote cache I put like this cache.put(key, new HashMap()); it is successfully added.
4) Now when I am going to clear this value using cache.remove(key) , I am seeing that it is not getting removed and the hash map is still there every time I go to remove it.
How can clear the value so that it will be cleared from all node of the cluster?
How can I also propagate the changes like adding or removing from the value HashMap above?
Has it anything to do with implementing DeltaAware and Delta interface?
Please suggest me about this concept or some pointers where I can learn
Thank you

Removal of the HashMap should work as long as you use the same key and have equals() and hashCode() correctly implemented on the key. I assume you're using distributed or replicated mode.
EDIT: I've realized that equals() and hashCode() are not that important for RemoteCache, since the key is serialized anyway and all the comparison will be executed on the underlying byte[].
Remote cache does not directly support DeltaAware. Generally, using these is quite tricky even in library mode.
If you want to use cache with maps, I suggest rather using composite key like cache-key#map-key than storing complex HashMap.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Using Cassandra Secondary Index with Hadoop - hadoop

Related

Lazy Logical Replication with MonetDB

Storing HDFS data only on specific nodes in a Hadoop Cluster

How do I get Sphinx stats?

HBase - hotspotting check

Usage of RemoteCache with DeltaAware and Delta interface infinispan

Categories

Resources