hdfs space is increasing drastically while updating documents in Solr - hadoop

​We are using Solr with HDFS for our indexing needs. While updating the existing documents(read existing doc and update) in our performance run, we observed that the HDFS storage space was growing exponentially. We are using the standard setting mentioned here: https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS. Any clues on what could be root cause for our issue? Thanks for your help.

We have been testing different configuration values to solve this issue. So far it seems that by enabling solr.hdfs.blockcache.direct.memory.allocation=true in the solrconfig.xml file the issue is solved.

Related

CouchBase Replication Error to Elastic Search

I have an existing replication in Couchbase -> ElasticSearch. I found out that there is now errors in replicating:
I tried to CREATE Replication again but it also gave the same error:
I already checked my elasticsearch plugin_head and I can see data in there and I can query with results. I restarted also my elasticsearch batch file but still error is persistent.
Anyone can help me on what else I need to check to further investigate the issue? Thank you in advance.
You may have a connectivity problem, which can happen due to networking issues like an IP address change since you initially setup the replication.
You might try the troubleshooting steps outlined here if you haven't already:
http://developer.couchbase.com/documentation/server/4.1/connectors/elasticsearch-2.1/trouble-intro.html
You should also check the goxdcr logs, which you can find here depending on the OS you're using:
http://developer.couchbase.com/documentation/server/4.0/troubleshooting/troubleshooting-logs.html

dncp_block_verification log file increases size in HDFS

We are using cloudera CDH 5.3. I am facing a problem wherein the size of "/dfs/dn/current/Bp-12345-IpAddress-123456789/dncp-block-verification.log.curr" and "dncp-vlock-verification.log.prev" keeps increasing to TBs within hours. I read in some of the blogs and they mention it is an HDFS bug. A temporary solution to this problem is to stop the datanode services and delete these files. But we have observed that the log file increases in size on either of the datanodes (even on the same node after deleting it). Thus, it requires continuous monitoring.
Does anyone have a permanent solution to this problem?
One solution, although slightly drastic, is to disable the block scanner entirely, by setting into the HDFS DataNode configuration the key dfs.datanode.scan.period.hours to 0 (default is 504 in hours). The negative effect of this is that your DNs may not auto-detect corrupted block files (and would need to wait upon a future block reading client to detect them instead); this isn't a big deal if your average replication is 3-ish, but you can consider the change as a short term one until you upgrade to a release that fixes the issue.
Note that this problem will not happen if you upgrade to the latest CDH 5.4.x or higher release versions, which includes the HDFS-7430 rewrite changes and associated bug fixes. These changes have done away with the use of such a local file, thereby removing the problem.

How to recover data from a renamed Elasticsearch cluster?

I have just spent the best part of 12 hours indexing 70 million documents into Elasticsearch (1.4) on a single node, single server setup on an EC2 Ubuntu 14.04 box. This completed successfully however before taking a snapshot of my server I thought it would be wise to rename the cluster to prevent it accidentally joining production boxes in the future, what a mistake that was! After renaming in the elasticsearch.yml file and restarting the ES service my indexes have disappeared.
I saw the data was still present in the data dir under the old cluster name, i tried stopping ES, moving the data manually in the filesystem and then starting the ES service again but still no luck. I then tried renaming back to the old cluster name, putting everything back in place and still nothing. The data is still there, all 44gb of it but I have no idea how to get this back. I have spent the past 2 hours searching and all i can seem to find is advice on how to restore from a snapshot which I don't have. Any advice would be hugely appreciated - I really hope I haven't lost a day's work. I will never rename a cluster again!
Thanks in advance.
I finally fixed this on my own: Stopped the cluster, deleted the nodes directory that had been created in the new cluster, copied my old nodes directort over being sure to respect the old structure exactly, chowned the folder to elasticsearch just in case, started up the cluster and breathed a huge sigh of relief to see 72 million documents!

Life of distributed cache in Hadoop

When files are transferred to nodes using the distributed cache mechanism in a Hadoop streaming job, does the system delete these files after a job is completed? If they are deleted, which i presume they are, is there a way to make the cache remain for multiple jobs? Does this work the same way on Amazon's Elastic Mapreduce?
I was digging around in the source code, and it looks like files are deleted by TrackerDistributedCacheManager about once a minute when their reference count drops to zero. The TaskRunner explicitly releases all its files at the end of a task. Maybe you should edit TaskRunner to not do this, and control the cache through more explicit means yourself?
I cross posted this question at the AWS forum and got a good recommendation to use hadoop fs -get to transfer files in a way that persists across jobs.

What is the best components stack for building distributed log aggregator (like Splunk)?

I'm trying to find the best components I could use to build something similar to Splunk in order to aggregate logs from a big number of servers in computing grid. Also it should be distributed because I have gigs of logs everyday and no single machine will be able to store logs.
I'm particularly interested in something that will work with Ruby and will work on Windows and latest Solaris (yeah, I got a zoo).
I see architecture as:
Log crawler (Ruby script).
Distributed log storage.
Distributed search engine.
Lightweight front end.
Log crawler and distributed search engine are out of questions - logs will be parsed by Ruby script and ElasticSearch will be used to index log messages. Front end is also very easy to choose - Sinatra.
My main problem is distributed log storage. I looked at MongoDB, CouchDB, HDFS, Cassandra and HBase.
MongoDB was rejected because it doesn't work on Solaris.
CouchDB doesn't support sharding (smartproxy is required to make it work but this is something I don't want to even try).
Cassandra works great but it's just a disk space hog and it requires running autobalance everyday to spread the load between Cassandra nodes.
HDFS looked promising but FileSystem API is Java only and JRuby was a pain.
HBase looked like a best solution around but deploying it and monitoring is just a disaster - in order to start HBase I need to start HDFS first, check that it started without problems, then start HBase and check it also, and then start REST service and also check it.
So I'm stuck. Something tells me HDFS or HBase are the best thing to use as a log storage, but HDFS only works smoothly with Java and HBase is just a deploying/monitoring nightmare.
Can anyone share its thoughts or experience building similar systems using components I described above or with something completely different?
I'd recommend using Flume to aggregate your data into HBase. You could also use the Elastic Search Sink for Flume to keep a search index up to date in real time.
For more, see my answer to a similar question on Quora.
With regards to Java and HDFS - using a tool like BeanShell, you can interact with the HDFS store via Javascript.

Resources