Strange replication in Cassandra - performance

I have configured locally 3 nodes in on 'Test Cluster' of Cassandra. When I run them and create some keyspace or table also on all three nodes the keyspace or the table appears.
The problem I'm dealing with is, when I'm importing from CSV millions of rows in the table I already built the whole data suddenly appears on all three nodes. I have the same data replicated over the three nodes.
As I'm familiar with, the data I'm importing should be replicated/distributed over the nodes but partially. One partition on the first node, second on third, third on second node, fourth again on first node and ...
Am I right or I'm missing something big?
Also, my write speed locally is about 10k rows / second for the multi-node cluster. Isn't that a little bit too low?
I want to create discussion so I can maybe learn something more from your experience and see where I'm messing things.
Thank you!

The number of nodes that data is written to in your cluster is determined by the Replication Factor for that keyspace. If you have 3 nodes and the data is being written to all the nodes, then this setting must be set to 3. If you only want the data the be replicated to two nodes, you'd set this value to two.
Your write speed will be affected by the consistency level you are specifying on the write. If you have it set to ALL then you have to wait until all the nodes that are going to write the data have written the data (in your case all 3 nodes based on your replication factor). Dropping your consistency level on the write will probably net you faster write times. There is a balance between your replication factor, write consistency level, and read consistency level that you can research further.

Related

How to write to specific datanode in hdfs using pyspark

I have a requirement to write common data to the same hdfs data nodes, like how we repartition in pyspark on a column to bring similar data into the same worker node, even replicas should be in the same node.
For instance, we have a file, table1.csv
Id, data
1, A
1, B
2, C
2, D
And another tablet.csv
Id, data
1, X
1, Y
2, Z
2, X1
Then datanode1 should only have (1,A),(1,B),(1,X),(1,Y)
and datanode2 should only have (2,C),(2,D),(2,Z),(2,X1)
And replication within datanodes.
It can be separate files as well based on keys. But each key should map it to a particular node.
I tried with pyspark writing to hdfs, but it just randomly assigned the datanodes when I checked with hdfs DFS fsck.
Read about rackid by setting rack topology but is there away to select which rack to store data on?
Any help is appreciated, I'm totally stuck.
KR
Alex
I maitain that without actually exposing the problem this is not going to help you but as you technically asked for a solution here's a couple ways to do what you want, but won't actually solve the underlying problem.
If you want to shift the problem to resource starvation:
Spark setting:
spark.locality.wait - technically doesn't solve your problem but is actually likely to help you immediately before you implement anything else I list here. This is should be your goto move before trying anything else as it's trivial to try.
Pro: just wait until you get a node with the data. Cheap and fast to implement.
Con: Doesn't promise to solve data locality, just will wait for a while incase the right nodes come up. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
** yarn labels**
to allocate your worker nodes to specific nodes.
Pro: This should ensure at least 1 copy of the data lands within a set of worker nodes/data nodes. If subsequent jobs also use this node label you should get good data locality. Technically it doesn't specify where data is written but by caveat yarn will write to the hdfs node it's on first.
Con: You will create congestion on these nodes, or may have to wait for other jobs to finish so you can get allocated or you may carve these into a queue that no other jobs can access reducing the functional capacity of yarn. (HDFS will still work fine)
Use Cluster federation
Ensures data lands inside a certain set of machines.
pro: A folder can be assigned to a set of data nodes
Con: You have to allocated another name node, and although this satisfies your requirement it doesn't mean you'll get data locality. Great example of something that will fit the requirement but might not solve the problem. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
My-Data-is-Everywhere
hadoop dfs -setrep -w 10000 /path of the file
Turn up replication for just the folder that contains this data equal to the number of nodes in the cluster.
Pro: All your data nodes have the data you need.
Con: You are wasting space. Not necessarily bad, but can't really be done for large amounts of data without impeding your cluster's space.
Whack-a-mole:
Turn off datanodes, until the data is replicated where you want it. Turn all other nodes back on.
Pro: You fulfill your requirement.
Con: It's very disruptive to anyone trying to use the cluster. It doesn't guarantee that when you run your job it will be placed on the nodes with the data. Again it kinda points out how silly your requirement is.
Racking-my-brain
Someone smarter than me might be able to develop a rack strategy in your cluster that would ensure data is always written to specific nodes that you could then "hope" you are allocated to... haven't fully developed the strategy in my mind but likely some math genius could work it out.
You could also implement HBASE and allocate region servers such that the data landed on the 3 servers. (As this would technically fulfill your requirement). (As it would be on 3 servers and in HDFS)

HBase Replication - Replicate data in 3 data centers

I our application we are having data from 3 different countries and we are persisting data in HBase.
In each country, we will be keeping data of all the 3 countries.
To achieve this, is it possible that we create our Hadoop cluster using data centers in all these 3 countries and we keep data replication as 3. So due to rack-awareness feature, our data will get auto replicated in all the 3 countries?
Any pointers will be of great help.
Thanks
You can’t have HBASE cluster across countries. This won’t work because of latency, failover problems, network issues, etc.
A good option would be to have 3 clusters, one HBase table per country and sync the tables between clusters as proposed above
As far as I know only Google has successfully implemented a multi-country database providing both consistency and availability: Spanner. But the key elements of the solution are: a private physical network between the Data Centers and their own implementation of NTP which guarantee that all servers across the world have the same clock with just a few milliseconds precision.
This solution looks theoretically feasible but writes may become pretty slow as data needs to replicated to 3 nodes located in different geographies. It needs to be tried out and check whether the latency is within tolerable limit.
Another option could be, to have three different HBase clusters at three locations and design tables in such a way that tables from one HBase cluster could be copied to another one during night hours to keep the data in sync daily. In this case, an HBase cluster will have current data from it's own location but the data from other two cities will lag by a day.

Hbase table duplication

There is a way to duplicate table data on every node of a cluster?
I need to do a performance test with the maximum grade of locality of the data.
By default, HBase distributes data on a small fraction of the cluster nodes (on 1 or 2 nodes), maybe because my data isn't very big-data ( ~ 2 GB ).
I know that Hbase is designed for much larger data sets, but in this case, it is a requirement for me.
There are a lot of nice reads* about it (see the end of the post) but I'll try to explain it with my own words ;)
HBase is not responsible of data replication, the Hadoop HDFS is, and by default is configured with a replication factor of 3, that means all data will be stored in at least 3 nodes.
Data locality is a key aspect to get good performance, but achieving maximum data locality is easy: you only need to colocate your HBase Regionservers (RS) along to the Hadoop Datanodes (DN), so, all your DN should have also the RS role. Once you have that, HBase will automatically move the data where it's needed (on major compactions) to achieve data locality and that's all, as long as each RS has the data of the regions it serves locally you'll have data locality.
Even when you have the data replicated to multiple DN, each region (and the rows they contain) will be served by just one RS, it doesn't matter you have a replication factor of 3, 10 or 100... Reading a row belonging to the region #1 will always hit the same RS, and that will be the one that hosts the region (which will read the data locally from the HDFS because of data locality). If the RS hosting that region goes down, the region will be assigned to another RS automatically (because the data is also replicated to other DN)
What you can do is to split your table in a way each RS has even buckets of rows (regions) assigned to it, so as much different RS as possible work simultaneously when you read or write data, increasing your overall throughput as long as you don't always hit the same regions (called regionserver hotspotting**).
Therefore, you should always start by ensuring that all the regions of your table are assigned to different RS and they receive the same volume of R/W requests. Once you've done that you can split your table into more regions once until you have an even number of regions on all the RS of your cluster (you may need to assign them manually if you're not happy with the load balancer).
Just remind that even when you seem to have a perfect distribution of regions you can still have poor performance if your data access pattern is not right (or it's uneven) and doesn't reach all regions evenly, in the end it all depends on your application.
(*) Recommended reads:
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
(**) To avoid RS hotspotting we always design our tables to have non-monotonically increasing row keys, so rows 1, 2, 3 ... N are hosted different regions, the common approach is to use the MD5(id) + id as rowkey. This approach has it's own set of drawbacks: you cannot scan the first 10 rows because they're salted.

Task scheduling with spark

I am running fairly large task on my 4 node cluster. I am reading around 4 GB of filtered data from a single table and running Naïve Baye’s training and prediction. I have HBase region server running on a single machine which is separate from the spark cluster running in fair scheduling mode, although HDFS is running on all machines.
While executing, I am experiencing strange task distribution in terms of the number of active tasks on the cluster. I observed that only one active task or at most two tasks are running on one/two machines at any point of time while the other are sitting idle. My expectation was that the data in the RDD will be divided and processed on all the nodes for operations like count and distinct etcetera. Why are all nodes not being used for large tasks of a single job? Does having HBase on a separate machine has anything to do with this?
Some things to check:
Presumably you are reading in your data using hadoopFile() or hadoopRDD(): consider setting the [optional] minPartitions parameter to make sure the number of partitions is equal to the number of nodes you want to use.
As you create other RDDs in your application, check the number of partitions of those RDDs and how evenly the data is distributed across them. (Sometimes an operation can create an RDD with the same number of partitions but can make the data within it badly unbalanced.) You can check this by calling the glom() method, printing the number of elements of the resulting RDD (the number of partitions) and then looping through it and printing the number of elements of each of the arrays. (This introduces communication so don't leave it in your production code.)
Many of the API calls on RDD have optional parameters for setting the number of partitions, and then there are calls like repartition() and coalesce() that can change the partitioning. Use them to fix problems you find using the above technique (but sometimes it will expose the need to rethink your algorithm.)
Check that you're actually using RDDs for all your large data, and haven't accidentally ended up with some big data structure on the master.
All of these assume that you have data skew problems rather than something more sinister. That's not guaranteed to be true, but you need to check your data skew situation before looking for something complicated. It's easy for data skew to creep in, especially given Spark's flexibility, and it can make a real mess.

Realizing different distribution models in hdfs?

As far as i have got to understand from the hadoop tuitorial, it takes the overall size of the input files and then divides them into the blocks/chunks then these block are replicated on different nodes.However i want to realize data distribution model according to the below given requirement -
(a) Case one : Each file is partitioned into the nodes in the cluster equally
-- so that each map gets this partition of table to be accessed. is it possible ?
(b) Case two : Each file is fully replicated in two or more nodes but not all nodes.
so that each map access some part of table on each node. is it possible ?
HDFS does not store tables, it stores files. Higher level projects offer 'relational tables', like Hive. Hive does allow you to partition a table stored on HDFS, see Hive Tutorial.
That being said, you should not tie partitioning to number of nodes in the cluster. Nodes come and go, clusters grow and shrink. Partitioned relational tables partition/bucket by natural boundaries w/o relation to cluster size. Import, export, daily operations all play a role in partitioning (and usually a much bigger role then cluster size). Even a single table (file) can well spread on each node of the cluster.
If you want to tune a MR job for optimal split size/location, there are plenty of ways to do that. You still have a lot to read, you are optimizing too early.

Resources