I am using SpatialHadoop to store and index a dataset with 87 million points. I then apply various range queries.
I tested on 3 different cluster configurations: 1 , 2 and 4 nodes.
Unfortunately, I don't see a runtime decrease with growing node number.
Any ideas why there is no horizontal-scaling effect?
How big is your file in megabytes? While it has 87 million points, it can still be small enough that Hadoop decides to create one or two splits only out of it.
If this is the case, you can try reducing the block size in your HDFS configuration so that the file will be split into several blocks.
Another possibility is that you might be running virtual nodes on the same machine which means that you do not get a real distributed environment.
Related
Am attempting to dump over 10 billion records into hbase which will
grow on average at 10 million per day and then attempt a full table
scan over the records. I understand that a full scan over hdfs will
be faster than hbase.
Hbase is being used to order the disparate data
on hdfs. The application is being built using spark.
The data is bulk-loaded onto hbase. Because of the various 2G limits, region size was reduced to 1.2G from an initial test of 3G (Still requires a bit more detail investigation).
scan cache is 1000 and cache blocks is off
Total hbase size is in the 6TB range, yielding several thousand regions across 5 region servers (nodes). (recommendation is low hundreds).
The spark job essentially runs across each row and then computes something based on columns within a range.
Using spark-on-hbase which internally uses the TableInputFormat the job ran in about 7.5 hrs.
In order to bypass the region servers, created a snapshot and used the TableSnapshotInputFormat instead. The job completed in abt 5.5 hrs.
Questions
When reading from hbase into spark, the regions seem to dictate the
spark-partition and thus the 2G limit. Hence problems with
caching Does this imply that region size needs to be small ?
The TableSnapshotInputFormat which bypasses the region severs and
reads directly from the snapshots, also creates it splits by Region
so would still fall into the region size problem above. It is
possible to read key-values from hfiles directly in which case the
split size is determined by the hdfs block size. Is there an
implementation of a scanner or other util which can read a row
directly from a hfile (to be specific from a snapshot referenced hfile) ?
Are there any other pointers to say configurations that may help to boost performance ? for instance the hdfs block size etc ? The main use case is a full table scan for the most part.
As it turns out this was actually pretty fast. Performance analysis showed that the problem lay in one of the object representations for an ip address, namely InetAddress took a significant amount to resolve an ip address. We resolved to using the raw bytes to extract whatever we needed. This itself made the job finish in about 2.5 hours.
A modelling of the problem as a Map Reduce problem and a run on MR2 with the same above change showed that it could finish in about 1 hr 20 minutes.
The iterative nature and smaller memory footprint helped the MR2 acheive more parallelism and hence was way faster.
I am loading 20 million non expiry entries in the Jboss Data Grid using Hotrod clients. My Hot rod clients are running on 5 different machines to load the data. The entries got added successfully. We have given a replication factor of 2. So there will be total 40 million entries in the grid. We found a variation of more than 10 % in the no of entries being added in each node. For eg, One node has 7.8 million entries while other node has 12 million entries.
So I was thinking why the entries are not equally distributed, ideally each node should have about 10 million entries. Our objective of the above test was to check whether the load/requests are getting equally distributed on all the nodes.
Any pointers on how the key/value pairs are distributed in JDG would be appreciated.
In Infinispan the hash space is divided into segments which then get mapped to the nodes in the cluster.
Entries are hashed by their keys by applying the MurmurHash3 function to them. This determines the segment which owns the key. It could be possible that your keys are causing a somewhat uneven distribution. You could try increasing the number of segments in your configuration. With your cluster, use at least 100 segments.
Also I had to lookup the meaning of "crore" and "lakh", as I had no idea what they were. You should probably use the 10M and 100K notations instead to make it easier to understand.
We are having a setup of 1 master and 2 slave nodes. The data is setup in postgres and in hbase and its a similar dataset (same number of rows) - 65 million rows. Yet, we dont find a measurable increase in performance from HBase for the same query.
My first thought is - does HBase use the compute capacity of all nodes to fork the query out? Perhaps this is why the performance is not measurably better.
Any other reasons for why the performance between Postgres and HBase would be about the same? Any specific configuration items to look for?
EDIT : Something I found while researching this : http://www.flurry.com/2012/06/12/137492485#.VaQP_5QpBpg
This is kind of a yes and no answer. Depending on what you are doing for your 'query' and your region distribution, you may or may not be using all the nodes. For example, if you are running a scan across the table it will run against each region (assuming more then one) in sequence. However if you are using a multi-get for keys that are in different regions, this will run in parallel.
The real benefit is going to come as the number of regions increase and you start parallelizing requests (multiple clients). Regions will be distributed across region servers by the Master as regions are split.
We are Hadoop newbies, we realize that hadoop is for processing big data, and how Cartesian product is extremely expensive. However we are having some experiments where we are running a Cartesian product job similar to the one in the MapReduce Design Patterns book except with a reducer calculating avg of all intermediate results( including only upper half of A*B, so total is A*B/2).
Our setting: 3 node cluster, block size = 64M, we tested different data set sizes ranging from
5000 points (130KB) to 10000 points (260KB).
Observations:
1- All map tasks are running on one node, sometimes on the master machine, other times on one of the slaves, but it never processed on more than one machine.Is there a way to force hadoop to distribute the splits therefore map tasks among machines? Based on what factors dose hadoop decide which machine is going to process the map tasks( in our case once it decided the master, in another case it decided a slave).
2- In all cases where we are testing the same job on different data sizes, we are getting 4 map tasks. Where dose the number 4 comes from?since our data size is less than the block size, why are we having 4 splits not 1.
3- Is there a way to see more information about exact splits for a running job.
Thanks in advance
What version of Hadoop are you using? I am going to assume a later version that uses YARN.
1) Hadoop should distribute the map tasks among your cluster automatically and not favor any specific nodes. It will place a map task as close to the data as possible, i.e. it will choose a NodeManager on the same host as a DataNode hosting a block. If such a NodeManager isn't available, then it will just pick a node to run your task. This means you should see all of your slave nodes running tasks when your job is launched. There may be other factors blocking Hadoop from using a node, such as the NodeManager being down, or not enough memory to start up a JVM on a specific node.
2) Is your file size slightly above 64MB? Even one byte over 67,108,864 bytes will create two splits. The CartesianInputFormat first computes the cross product of all the blocks in your data set. Having a file that is two blocks will create four splits -- A1xB1, A1xB2, A2xB1, A2xB2. Try a smaller file and see if you are still getting four splits.
3) You can see the running job in the UI of your ResourceManager. https://:8088 will open the main page (jobtracker-host:50030 for MRv1) and you can navigate to your running job from there, which will get you to see individual tasks that are running. If you want more specifics on what the input format is doing, add some log statements to the CartesianInputFormat's getSplits method and re-run your code to see what is going on.
I am facing a strange problem due to Hadoop's crazy data distribution and management. one or two of my data nodes are completely filled up due to Non-DFS usage where as the others are almost empty. Is there a way I can make the non-dfs usage more uniform?
[I have already tried using dfs.datanode.du.reserved but that doesn't help either]
Example for the prob: I have 16 data nodes with 10 GB space each. Initially, each of the nodes have approx. 7 GB free space. When I start a job for processing 5 GB of data (with replication factor=1), I expect the job to complete successfully. But alas! when I monitor the job execution, I see suddenly one node runs out of space because the non-dfs usage is approx 6-7 GB and then it retries and another node now runs out of space. I don't really want to have higher retries because that's won't give the performance metric I am looking for.
Any idea how can I fix this issue.
It sounds like your input isn't being split up properly. You may want to choose a different InputFormat or write your own to better fit your data set. Also make sure that all your nodes are listed in your NameNode's slaves file.
Another problem can be serious data skew - case when big part of data is going to one reducer. You may need to create you own partitioner to solve it.
You can not restrict non-dfs usage, as far as I know. I would suggest to identify what exactly input file (or its split) cause the problem. Then you probably will be able to find solution.
Hadoop MR built under assumption that single split processing can be done using single node resources like RAM or disk space.