Hadoop Container failed even 100 percent completed - hadoop

I have setup a small cluster Hadoop 2.7, Hbase 0.98 and Nutch 2.3.1. I have wrote a custom job that simple first combine docs of same domain, after that each URL of domain (from cache i.e., a list) is first obtained from from cache and then corresponding key is used to fetched the object via datastore.get(url_key) and then after updating score, it is written via context.write.
The job should complete after all docs are processed but what I have observed that each attempt if failed due to timeout and progress is 100 percent complete show. Here is the LOG
attempt_1549963404554_0110_r_000001_1 100.00 FAILED reduce > reduce node2:8042 logs Thu Feb 21 20:50:43 +0500 2019 Fri Feb 22 02:11:44 +0500 2019 5hrs, 21mins, 0sec AttemptID:attempt_1549963404554_0110_r_000001_1 Timed out after 1800 secs Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
attempt_1549963404554_0110_r_000001_3 100.00 FAILED reduce > reduce node1:8042 logs Fri Feb 22 04:39:08 +0500 2019 Fri Feb 22 07:25:44 +0500 2019 2hrs, 46mins, 35sec AttemptID:attempt_1549963404554_0110_r_000001_3 Timed out after 1800 secs Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
attempt_1549963404554_0110_r_000002_0 100.00 FAILED reduce > reduce node3:8042 logs Thu Feb 21 12:38:45 +0500 2019 Thu Feb 21 22:50:13 +0500 2019 10hrs, 11mins, 28sec AttemptID:attempt_1549963404554_0110_r_000002_0 Timed out after 1800 secs Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143
What it is so i.e., when an attempt is 100.00 percent complete then it should be marked as successfull. Unfortunately, there is any error information other than timeout for my case. How to debug this problem ?
My reducer is somewhat posted to another question
Apache Nutch 2.3.1 map-reduce timeout occurred while updating the score

I have observed that, in the mentioned 3 logs the time required for execution is varied with big difference. Please look upto the job which you are executing once.

Related

Analysis of Redis error logs "LOADING Redis is loading the dataset in memory" and more

I am frequently seeing these messages in the redis logs
1#
602854:M 23 Dec 2022 09:48:54.028 * 10 changes in 300 seconds. Saving...
602854:M 23 Dec 2022 09:48:54.035 * Background saving started by pid 3266364
3266364:C 23 Dec 2022 09:48:55.844 * DB saved on disk
3266364:C 23 Dec 2022 09:48:55.852 * RDB: 12 MB of memory used by copy-on-write
602854:M 23 Dec 2022 09:48:55.938 * Background saving terminated with success
2#
LOADING Redis is loading the dataset in memory
3#
7678:signal-handler (1671738516) Received SIGTERM scheduling shutdown...
7678:M 22 Dec 2022 23:48:36.300 # User requested shutdown...
7678:M 22 Dec 2022 23:48:36.300 # systemd supervision requested, but NOTIFY_SOCKET not found
7678:M 22 Dec 2022 23:48:36.300 * Saving the final RDB snapshot before exiting.
7678:M 22 Dec 2022 23:48:36.300 # systemd supervision requested, but NOTIFY_SOCKET not found
7678:M 22 Dec 2022 23:48:36.720 * DB saved on disk
7678:M 22 Dec 2022 23:48:36.720 * Removing the pid file.
7678:M 22 Dec 2022 23:48:36.720 # Redis is now ready to exit, bye bye...
7901:C 22 Dec 2022 23:48:37.071 # WARNING supervised by systemd - you MUST set appropriate values for TimeoutStartSec and TimeoutStopSec in your service unit.
7901:C 22 Dec 2022 23:48:37.071 # systemd supervision requested, but NOTIFY_SOCKET not found
7914:C 22 Dec 2022 23:48:37.071 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
7914:C 22 Dec 2022 23:48:37.071 # Redis version=6.0.9, bits=64, commit=00000000, modified=0, pid=7914, just started
7914:C 22 Dec 2022 23:48:37.071 # Configuration loaded
Are these messages concerning?
Let me know if there's any optimization to be carried out in terms of settings.
The first set of informational messages is related to Redis persistence: it appears your Redis node is configured to save the database to disk if 300 seconds elapsed and it surpassed 10 write operations against it. You can change that according to your needs through the Redis configuration file.
The message LOADING Redis is loading the dataset in memory, on the other side, is an error message received while attempting to connect to a Redis instance which is loading its dataset in memory: that occurs during the startup for standalone servers and master nodes or when replicas reconnect and fully resynchronize with master. If you are seeing this error too often and not right after a system restart, I would suggest to check your system log files and learn why your Redis instance is restarting or resynchronizing (depending on your topology).

Compute Engine start node server on instance startup

I am trying to run a discord bot in a node application in a free Compute Engine instance. I am struggling to make a script that actually starts the node app.
I created this script and added it as startup-script metadata from file:
cd code/movo-tron-2000 && npm start &
I checked that the script runs with sudo google_metadata_script_runner --script-type startup --debug, but when I restart the instance, the app doesn't start. Running sudo journalctl -u google-startup-scripts.service prints the following logs:
Apr 20 12:19:08 bot-vm systemd[1]: Starting Google Compute Engine Startup Scripts...
Apr 20 12:19:09 bot-vm startup-script[691]: INFO Starting startup scripts.
Apr 20 12:19:09 bot-vm startup-script[691]: INFO Found startup-script in metadata.
Apr 20 12:19:09 bot-vm startup-script[691]: INFO startup-script: /startup-od52epug/tmpjy_z4vue: line 1: cd: code/mo
Apr 20 12:19:09 bot-vm startup-script[691]: INFO startup-script: Return code 0.
Apr 20 12:19:09 bot-vm startup-script[691]: INFO Finished running startup scripts.
Apr 20 12:19:09 bot-vm systemd[1]: Started Google Compute Engine Startup Scripts.
I see that the script gets executed, but also gets terminated. The app listened for requests, so it shouldn't get terminated in order to run. I assume that the startup script gets run on the same thread as the google compute engine startup script so it gets terminated in order to continue the vm boot. What should I change in my startup script in order to start my app properly and not have it terminated by the instance?
Edit: I set up the following systemd service and script at their corresponding locations
Service:
[Unit]
Description=Start bot
[Service]
ExecStart=/home/me_adi_hf/code/movo-tron-2000/start.sh
[Install]
WantedBy=default.target
Script:
#!/bin/sh
date > /root/bot_report.txt
du -sh /home/ >> /root/bot_report.txt
But when running sudo systemd start bot.service and then checking it's status with sudo systemd status bot.service am getting this output, indicating Exec format error:
bot-start.service - Start bot
Loaded: loaded (/etc/systemd/system/bot-start.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2020-04-21 09:43:01 UTC; 9s ago
Process: 19303 ExecStart=/home/me_adi_hf/code/movo-tron-2000/start.sh (code=exited, status=203/EXEC)
Main PID: 19303 (code=exited, status=203/EXEC)
Apr 21 09:43:01 bot-vm systemd[1]: Started Start bot.
Apr 21 09:43:01 bot-vm systemd[19303]: bot-start.service: Failed at step EXEC spawning /home/me_adi_hf/code/movo-tron-2000/start.sh: Exec format error
Apr 21 09:43:01 bot-vm systemd[1]: bot-start.service: Main process exited, code=exited, status=203/EXEC
Apr 21 09:43:01 bot-vm systemd[1]: bot-start.service: Unit entered failed state.
Apr 21 09:43:01 bot-vm systemd[1]: bot-start.service: Failed with result 'exit-code'.
I am not sure what causes the error, since the service file syntax looks correct

Heroku restarting with SIGTERM status 143

I have a scraper running on Heroku. It has been running for a while (+- 2 months) and it has days where it does great and reaches its 1,000 maximum and days during which it just magically restarts.
Does anyone know what the reason could be for such a restart? The scraper is showing no errors, the only thing I can find is the message below in the Heroku logs:
Feb 05 03:02:55 scraper heroku/web.1: Cycling
Feb 05 03:02:55 scraper heroku/web.1: State changed from up to starting
Feb 05 03:02:57 scraper heroku/web.1: Stopping all processes with SIGTERM
Feb 05 03:02:57 scraper heroku/web.1: Process exited with status 143
Feb 05 03:03:16 scraper heroku/web.1: Starting process with command `npm start`
The Cycling bit of the log here is the interesting one.
Heroku will restart dynos every 24h, and this process is called "cycling". This is what you're seeing here.

Getting Netty client related error in storm topology and worker restarting

Version Info:
"org.apache.storm" % "storm-core" % "1.2.1"
"org.apache.storm" % "storm-kafka-client" % "1.2.1"
I have a storm Topology with 3 bolts(A,B,C), Where the middle bolt takes around 450ms mean time and other two bolts takes less than 1ms.
I am running topology with following parallelism hint values on two machines:
A: 4
B: 700
C: 10
I am getting following error after few minutes of topology starting:
in worker log:
2018-07-04T20:16:28.835+05:30 Client [ERROR] discarding 7 messages because the Netty client to Netty-Client-/ip:6700 is being closed
in supervisor logs:
2018-07-04 20:16:29.468 o.a.s.d.s.BasicContainer [INFO] Worker Process 32bc11c0-a1d0-4593-a91a-3ff788ea041a exited with code: 20
2018-07-04 20:16:31.592 o.a.s.d.s.Slot [WARN] SLOT 6700: main process has exited
2018-07-04 20:16:31.592 o.a.s.d.s.Container [INFO] Killing 2825cbe9-aedd-4f10-a796-4f9dc30ae72f:32bc11c0-a1d0-4593-a91a-3ff788ea041a
2018-07-04 20:16:31.600 o.a.s.u.Utils [INFO] Error when trying to kill 7422. Process is probably already dead.
2018-07-04 20:16:32.600 o.a.s.d.s.Slot [INFO] STATE RUNNING msInState: 391195 topo:myTopo-1-1530715184 worker:32bc11c0-a1d0-4593-a91a-3ff788ea041a -> KILL_AND_RELAUNCH msInState: 0 topo:myTopo-1-1530715184 worker:32bc11c0-a1d0-4593-a91a-3ff788ea041a
2018-07-04 20:16:32.600 o.a.s.d.s.Container [INFO] GET worker-user for 32bc11c0-a1d0-4593-a91a-3ff788ea041a
I see similar question asked here and here, I have few queries related to this:
Why is this error coming and how to resolve?
How to get more debug information from Storm, I have already set conf.setDebug(true)
Is there some limitation/guidelines around how much parallelism factor os OK for a bolt on n number of machines?
Edit:
Logs for strace -fp PID -e trace=read,write,network,signal,ipc in gist. Some relevant looking part is when above thing happends, but however I see such SIGSEGV many places in strace output :
[pid 23635] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7f83af6f1180} ---
[pid 23549] <... read resumed> "PK\3\4\n\0\0\0\10\0\364J\336F\222'\202\312\310\2\0\0\16\5\0\0\36\0\0\0", 30) = 30
[pid 23654] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7f83af6f1f80} ---
[pid 23549] read(23, "\235TmW\22A\24~\6\224\227u\vE4\255,JR\300WP\322\0245TH\23\313\3j\347"..., 712) = 712
[pid 23654] rt_sigreturn({mask=[QUIT]}) = 140203560738688
[pid 23635] rt_sigreturn({mask=[QUIT]}) = 140203560735104
strace output of worker process is here, relevant looking logs here:
[pid 24435] recvfrom(291, "HTTP/1.1 200 OK\r\nContent-Type: a"..., 8192, 0, NULL, NULL) = 544
[pid 23473] write(3, "Heap\n garbage-first heap total"..., 347) = 347
[pid 24434] +++ exited with 20 +++
[pid 24405] +++ exited with 20 +++
[pid 24435] +++ exited with 20 +++
[pid 24427] +++ exited with 20 +++
Edit 2:
There is this question as well: Connection refused error in worker logs - apache storm : as par it's answer not setting storm.local.hostname might cause it, but it is already set for me.
There is another bug filed here having similar netty error, which is also still unresolved.

macOS Server 5.3 Calendar pg_ctl not starting

After updating macOS Server to 5.3 (running on macOS 10.12.4) my Calendar & Contacts have stopped syncing.
It seems that it's having trouble starting Postgres for cluster /Library/Server/Calendar and Contacts/Data/Database.xpg/cluster.pg and possibly trouble with the agent too.
The GUI seems to think that the Calendar & Contacts services have started and are available, but when I run $ sudo serveradmin fullstatus calendar from the command line I get:
calendar:setStateVersion = 1
calendar:readWriteSettingsVersion = 1
calendar:state = "STARTING"
calendar:contactsState = "STARTING"
calendar:calendarState = "STARTING"
System log is being spammed with:
Apr 22 11:58:42 com.apple.xpc.launchd[1] (org.calendarserver.agent[44649]): Service exited with abnormal code: 1
Apr 22 11:58:42 com.apple.xpc.launchd[1] (org.calendarserver.agent): Service only ran for 0 seconds. Pushing respawn out by 10 seconds.
Apr 22 11:58:52 com.apple.xpc.launchd[1] (org.calendarserver.agent[44659]): Service exited with abnormal code: 1
Apr 22 11:58:52 com.apple.xpc.launchd[1] (org.calendarserver.agent): Service only ran for 0 seconds. Pushing respawn out by 10 seconds.
Apr 22 11:59:02 com.apple.xpc.launchd[1] (org.calendarserver.agent[44668]): Service exited with abnormal code: 1
Apr 22 11:59:02 com.apple.xpc.launchd[1] (org.calendarserver.agent): Service only ran for 0 seconds. Pushing respawn out by 10 seconds.
Apr 22 11:59:07 com.apple.xpc.launchd[1] (org.calendarserver.calendarserver[44676]): Service exited with abnormal code: 1
Apr 22 11:59:07 com.apple.xpc.launchd[1] (org.calendarserver.calendarserver): Service only ran for 0 seconds. Pushing respawn out by 60 seconds.
Here's the output of $ sudo /Applications/Server.app/Contents/ServerRoot/usr/sbin/calendarserver_diagnose
Any ideas?
OS Build: 16E195
Server Build: 16S4123
/Library/Server/Preferences/Calendar.plist exists and can be parsed
Prefs plist says ServerRoot directory is: /Library/Server/Calendar and Contacts
ServerRoot volume ok
/Library/Server/Calendar and Contacts/Config/caldavd-system.plist exists and can be parsed
/Library/Server/Calendar and Contacts/Config/caldavd-user.plist does not exist
Configuration:
Calendar and Contacts service processes:
USER PID %CPU %MEM RSS ELAPSED STARTED COMMAND
root 42554 0.0 0.1 11072 07:49 Sat 22 Apr 11:32:16 2017 servermgr_calendar
Serverd status:
org.calendarserver.agent is enabled
org.calendarserver.calendarserver is enabled
org.calendarserver.relocate is enabled
Disk space on boot volume:
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 999G 777G 222G 78% 8520180 4286447099 0% /
Disk space on service data volume:
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 999G 777G 222G 78% 8520180 4286447099 0% /
Disk space used by Calendar and Contacts service:
20K /Library/Server/Calendar and Contacts/Config
1014M /Library/Server/Calendar and Contacts/Data
200M /Library/Server/Calendar and Contacts/Logs
Postgres status for cluster /Library/Server/Calendar and Contacts/Data/Database.xpg/cluster.pg:
pg_ctl: no server running
Agent:
Attempting to send a request to the agent...
Can't connect to agent: timed out
Server connection:
Traceback (most recent call last):
File "/Applications/Server.app/Contents/ServerRoot/usr/sbin/calendarserver_diagnose", line 14, in <module>
load_entry_point('CalendarServer==9.1a1.dev0+56b4197875debefef19d9c19840f903a8e480c88.head', 'console_scripts', 'calendarserver_diagnose')()
File "/Applications/Server.app/Contents/ServerRoot/Library/CalendarServer/lib/python2.7/site-packages/calendarserver/tools/diagnose.py", line 145, in main
connectToCaldavd(keys)
File "/Applications/Server.app/Contents/ServerRoot/Library/CalendarServer/lib/python2.7/site-packages/calendarserver/tools/diagnose.py", line 584, in connectToCaldavd
url = "https://{host}/principals/".format(host=keys["ServerHostName"])
KeyError: 'ServerHostName'

Resources