I am using elasticdump to dump data from local machine to the server. But my dumps always ended with this error:
...
Tue, 20 Oct 2015 22:56:35 GMT | sent 100 objects to destination elasticsearch, wrote 100
Tue, 20 Oct 2015 22:56:35 GMT | got 100 objects from source elasticsearch (offset: 21200)
Tue, 20 Oct 2015 22:56:36 GMT | Error Emitted => read ECONNRESET
Tue, 20 Oct 2015 22:56:36 GMT | Total Writes: 21200
Tue, 20 Oct 2015 22:56:36 GMT | dump ended with error (set phase) => Error: read ECONNRESET
...
How should I solve this problem?
Is there a better way to dump data from local machine to the server? Thanks in advance!
It sounds like your issue is being caused by the elasticdump opening too many sockets to your elasticsearch cluster. You can use the --maxSockets option to limit the number of sockets opened.
For example:
$ elasticdump --input http://192.168.2.222:9200/index1 --output http://192.168.2.222:9200/index2 --type=data --maxSockets=5
You can find a detailed explanation of the issue here:
https://github.com/taskrabbit/elasticsearch-dump/issues/98
Related
I'm using elasticdump and got weird error
Mon, 14 Nov 2022 14:42:21 GMT | starting dump
Mon, 14 Nov 2022 14:42:22 GMT | got 10 objects from source elasticsearch (offset: 0)
Mon, 14 Nov 2022 14:42:22 GMT | sent 10 objects to destination file, wrote 10
Mon, 14 Nov 2022 14:42:22 GMT | Error Emitted => This and all future requests should be directed to the given URI.
Mon, 14 Nov 2022 14:42:22 GMT | Error Emitted => This and all future requests should be directed to the given URI.
Mon, 14 Nov 2022 14:42:22 GMT | Total Writes: 0
Mon, 14 Nov 2022 14:42:22 GMT | dump ended with error (get phase) => MOVED_PERMANENTLY: This and all future requests should be directed to the given URI.
It successfully moved 10 objects and stopped
--input-index is for a different use case.
Try with just --input like this
elasticdump --input=http://localhost/dev_index --output=test2.json
I am trying to create a file from a non-hadoop environment to a remote hdfs.
For this purpose, I am using pywebhdfs api and I'm running command using curl.
https://pythonhosted.org/pywebhdfs/
I used this documentation as a reference, I am able to execute all other methods except of create_file().
While using create_file(), I am getting error like: 'Couldn't connect to host'
Command: curl -i -X PUT -L "http://xxx.xxx.xxx.xxx:50070/webhdfs/v1/test1/?op=CREATE" -T sample.txt
Response:HTTP/1.1 100 Continue
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Tue, 30 Oct 2018 12:04:04 GMT
Date: Tue, 30 Oct 2018 12:04:04 GMT
Pragma: no-cache
Expires: Tue, 30 Oct 2018 12:04:04 GMT
Date: Tue, 30 Oct 2018 12:04:04 GMT
Pragma: no-cache
Content-Type: application/octet-stream
Location: http://localhost:50075/webhdfs/v1/test1/?op=CREATE&namenoderpcaddress=xxx.xxx.xxx.xxx:9000&overwrite=false
Content-Length: 0
Server: Jetty(6.1.26)
curl: (7) couldn't connect to host
This Location is displaying as localhost here. I took reference from the past post.
webhdfs always redirect to localhost:50075
but I didn't get success.
I tried changing IP in hdfs-site.xml and /etc/hosts file but no success at all.
Can anyone tell me, how to fix this?
Thanks in advance..
I am trying to import index from json file to elasticsearch server but it is failing.
Specifications:
elasticsearch : 4.10.3
elasticdump : 2.4.2
command I am using
elasticdump --input=/home/ubuntu/Files/stocks.json --output=http://localhost:9200/ --type=data`
My stocks.json file looks like
{"_index":"stocks","_type":"stock","_id":"AVhKm5L8FPDye23IuJqe","_score":1,"_source":{"name":"Sun Pharmaceutical Industries Ltd.","industry":"PHARMA","isin":"INE044A01036","symbol":"SUNPHARMA","tweet":"sun pharma' OR 'SUNPHARMA'"}}
{"_index":"stocks","_type":"stock","_id":"AVhKm5L8FPDye23IuJqV","_score":1,"_source":{"name":"Tata Steel Ltd.","industry":"METALS","isin":"INE081A01012","symbol":"TATASTEEL","tweet":"tata steel' OR 'TATASTEEL'"}}
{"_index":"stocks","_type":"stock","_id":"AVhKm5L7FPDye23IuJp2","_score":1,"_source":{"name":"ICICI Bank Ltd.","industry":"FINANCIAL SERVICES","isin":"INE090A01021","symbol":"ICICIBANK","tweet":"icici bank' OR 'ICICIBANK'"}}
I am getting following message
Sat, 07 Oct 2017 05:46:52 GMT | starting dump
Sat, 07 Oct 2017 05:46:52 GMT | got 100 objects from source file
(offset: 0)
Sat, 07 Oct 2017 05:46:52 GMT | sent 100 objects to destination
elasticsearch, wrote 0
Sat, 07 Oct 2017 05:46:52 GMT | got 0 objects from source file
(offset: 100)
Sat, 07 Oct 2017 05:46:52 GMT | Total Writes: 0
Sat, 07 Oct 2017 05:46:52 GMT | dump complete
I had used same json file before but somehow this is not working in this new server. I have installed elasticsearch, node recently in this server.
Thanks for help
J
I use an apache storm topology on a cluster of 8+1 machines. The date on these machines is not the same and we may have more than 5 minutes of difference.
preprod-storm-nimbus-01:
Thu Feb 25 16:20:30 GMT 2016
preprod-storm-supervisor-01:
Thu Feb 25 16:20:32 GMT 2016
preprod-storm-supervisor-02:
Thu Feb 25 16:20:32 GMT 2016
preprod-storm-supervisor-03:
Thu Feb 25 16:14:54 UTC 2016 <<-- this machine is very late :(
preprod-storm-supervisor-04:
Thu Feb 25 16:20:31 GMT 2016
preprod-storm-supervisor-05:
Thu Feb 25 16:20:17 GMT 2016
preprod-storm-supervisor-06:
Thu Feb 25 16:20:00 GMT 2016
preprod-storm-supervisor-07:
Thu Feb 25 16:20:31 GMT 2016
preprod-storm-supervisor-08:
Thu Feb 25 16:19:55 GMT 2016
preprod-storm-supervisor-09:
Thu Feb 25 16:20:30 GMT 2016
Question:
Is the storm topology affected by this non-synchronization?
Note: I know that synchronizing is better, but the sysadmins won't do it without proving them proofs/reasons that they have to do it. Do they really have to do it, "for the topology's sake" :) ?
Thanks
It depends on the computation you are doing... It might have an effect on your result if you do time based window operations. Otherwise, it doesn't matter.
For Storm as an execution engine it has no effect at all.
I created a simple file in HDFS at the path /user/admin/foo.txt
I can see the contents of this file in Hue.
How I issue the command
curl -i http://namenode:50070/webhdfs/v1/user/admin/foo.txt?op=OPEN
I get the response
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Tue, 24 Nov 2015 16:20:15 GMT
Date: Tue, 24 Nov 2015 16:20:15 GMT
Pragma: no-cache
Expires: Tue, 24 Nov 2015 16:20:15 GMT
Date: Tue, 24 Nov 2015 16:20:15 GMT
Pragma: no-cache
Location: http://datanode:50075/webhdfs/v1/user/admin/foo.txt?op=OPEN&namenoderpcaddress=nameservice1&offset=0
Content-Type: application/octet-stream
Content-Length: 0
Server: Jetty(6.1.26.cloudera.4)
why is the content-length: 0?? I was hoping that this would list the contents of the file.
Execute:
curl -i http://datanode:50075/webhdfs/v1/user/admin/foo.txt?op=OPEN&namenoderpcaddress=nameservice1&offset=0
As for the explanation - when using WebHDFS to open a file you have to do the following:
You don't know which node the file resides on, so you ask the namenode.
The namenode returns you a datanode containing the file.
You can then open the file itself by talking directly to the datanode.
So this activity is expected. See https://hadoop.apache.org/docs/r1.0.4/webhdfs.html for more information.