When one node is very slow in Hadoop Cluster? - hadoop

I've hadoop cluster of 5 nodes.
I've two concerns
1) What can be done when one of the node is running or processing data very slow (Not stopped) comapre to other nodes .. ?
2) I've set up log4j to capture logs, but How can I keep logs of all nodes at Name node or at one main server .. ?
Please suggest ...!
Thanks

To question one, it's not clear which service is slow... Datanode? Namenode? Maybe you need to increase the heap sizes of these processes, or the Dataset you've stored is heavily skewed onto that server.
You would need to install monitoring software to capture IO, CPU, network, etc metrics to really diagnose any hardware bottlenecks. From there, make sure that that one server is running latest OS patches, has latest drivers, and a similar hardware profile of other machines you're comparing against. Maybe the hard drive or NIC is failing, but without hardware diagnostic software, it'd be hard to know
For question 2, you'd again need additional software, such as Elasticsearch, to centrally collect and index your logs from many systems

Related

Running Hadoop in virtual environment

I would like to know whether I should expect problems when having Hadoop cluster on virtual instead of physical machines?
I'm mostly worried about using the same hard drive, I read that I should count for 1-2 containers per drive,but in my case only one drive will exist. Could that be a problem?
I think it depends upon how much size are you allocating for containers. Of course there would be limitation to number of containers if you have restriction to the memory.
I can highlight few points which can be considered while running hadoop cluster in virtual environment:
Network configuration in case of multi node cluster
Obvious the performance of application
Affect on scalability as limited resources if you are planning to run the cluster on host which has low configuration hardware

File sync between n web servers in cluster

There are n nodes in a web cluster. Files may be uploaded to any node and then must be distributed to every other node. This distribution does not have to happen in a transaction (in fact it must not, distributed transactions don't scale) and some latency is acceptable, although must be minimal. Conflicts can be resolved arbitrarily (typically last write wins) provided that the resolution is also distributed to all nodes so that eventually all nodes have the same set of files. Nodes can be added and removed dynamically without having to reconfigure existing nodes. There must be no single point of failure and no additional boxes required to solve this (such as RabbitMQ)
I am thinking along the lines of using consul.io for dynamic configuration so that each node can refer to consul to determine what other nodes are available and writing a daemon (Golang) that monitors the relevant folders and communicates with other nodes using ZeroMQ.
Feels like I would be re-inventing the wheel though. This is a common problem and I expect there are solutions available already that I don't know about? Or perhaps my approach is wrong and there is another way to solve this?
Yes, there has been some stuff going on with distributed synchronization lately:
You could use syncthing (open source) or BitTorrent Sync.
Syncthing is node-based, i.e. you add nodes to a cluster and choose which folders to synchronize.
BTSync is folder-based, i.e. you obtain a "secret" for a folder and can synchronize with everyone in the swarm for that folder.
From my experience, BTSync has a better discovery and connectivity, but the whole synchronization process is closed source and nobody really knows what happens. Syncthing is written in go, but sometimes has trouble discovering peers.
Both syncthing and BTSync use LAN discovery via broadcast and a tracker for discovery, AFAIK.
EDIT: Or, if you're really cool, use IPFS to host the latest version, IPNS to "name" that and mount the IPNS on the servers. You can set the IPFS bootstrap list to some of your servers, which would even make you independent of external trackers. :)

Hadoop Performance Monitoring tools for Windows

Any tools for monitoring performance on a Hadoop cluster in Windows. We installed Hortonworks HDP 2.2.0 on windows single node cluster and tested our jar. we were able to process 5 million records in 26 minutes. Now we have set up a cluster with 4 slave machines and 1 name node. Though the RAM of each machine is 8 Gigs, we are just doing a proof of concept. we see no improvement in the processing time in the cluster. Are there any tools which point out the problem. All the available are written for Linux.
Thanks,
Kishore.
5 million records doesn't sound like a lot to throw on Hadoop. What's the size of your data in gb?
I don't know any Hadoop monitoring tools for Windows but you should start with the basics - is your data splittable? Have a look at the resource manager's view - how many containers did you have for your map-reduce app? Were they distributed on all machines? (the capacity scheduler tends not to distribute the load on several machines if it can stick all of it on one). CPU usage per task attempt, io per task attempt?
You should also store, compare and analyze Windows performance counters - cpu, i/o, network to see if you have any bottlenecks.
You may not need Windows-native tools to surface the kinds of performance metrics you are looking for. If you're after performance metrics from YARN, MapReduce, or HDFS, you can collect metrics from each of those technologies out of the box from a web interface/HTTP endpoint exposed by each tech in question.
With HDFS, for example, you can collect metrics from the NameNode and DataNodes via HTTP. In addition, you can access the full suite of metrics via JMX, though that option requires a little more configuration.
I wrote a guide to collecting Hadoop performance metrics with native tools which you might find useful. It details methods for collecting metrics for MapReduce, YARN, HDFS, and ZooKeeper.

Marklogic latency : Document not found

I am working on a clustered marklogic environment where we have 10 Nodes. All nodes are shared E&D Nodes.
Problem that we are facing:
When a page is written in marklogic its takes some time (upto 3 secs) for all the nodes in the cluster to get updated & its during this time if I then do a read operation to fetch the previously written page, its not found.
Has anyone experienced this latency issue? and looked at eliminating it then please let me know.
Thanks
It's normal for a new document to only appear after the database transaction commits. But it is not normal for a commit to take 3-sec.
Which version of MarkLogic Server?
Which OS and version?
Can you describe the hardware configuration?
How large are these documents? All other things equal, update time should be proportional to document size.
Can you reproduce this with a standalone host? That should eliminate cluster-related network latency from the transaction, which might tell you something. Possibly your cluster network has problems, or possibly one or more of the hosts has problems.
If you can reproduce the problem with a standalone host, use system monitoring to see what that host is doing at the time. On linux I favor something like iostat -Mxz 5 and top, but other tools can also help. The problem could be disk I/O - though it would have to be really slow to result in 3-sec commits. Or it might be that your servers are low on RAM, so they are paging during the commit phase.
If you can't reproduce it with a standalone host, then I think you'll have to run similar system monitoring on all the hosts in the cluster. That's harder, but for 10 hosts it is just barely manageable.

running Hadoop software on office computers (when they are idle)

Is there a project which helps setup a Hadoop cluster on office desktops, when they are idle?
I'd like to experiment with Hadoop/MR/hbase but don't have acces to 5-10 computers. The computers at work are idle after hours and are connected to each other through a very high speed connection. What's more, data on these computers stays within our network so there is no privacy issue.
In order for this to work I need a fairly light weight monitor running on each machine. When the computer has been idle for X hours, it will join the cluster. If the user logs on, it has to drop out of the cluster and return all CPU/memory back.
Does something like this exist?
You can use task scheduler to detect idle state and then start/stop a hadoop vm with virtual box or vmplayer. Or you can write a powershell script that does start stop based on resource usage.
Hadoop is not a computation grid it is a more a data grid (see slide 9 in this presentation). The point is that with hadoop that data is spread over the cluster and thus the data has to be stored on the computers. The time it would take to copy the data over/remove it when they're not idle would probably not be worth it - you'd be better off using hadoop in the cloud (amazon,Azure etc.)
I would use something like Condor: http://research.cs.wisc.edu/condor/
You might want to take a look at Virginia Tech's Project Moon http://www.wired.com/wiredenterprise/2012/05/project_moon/
Look at solutions like NEREUS which is a good MPC solution in Java

Resources