My application is sending data to Kafka, Kerberos is used for authentication. Everything works fine for around 20 days, then I get the following exception:
2020-01-07 22:22:08.481 DEBUG 24987 --- [fka-producer-network-thread | producer-1] org.apache.kafka.clients.NetworkClient : Initiating connection to node mkav2.dc.ex.com:9092 (id: 101 rack: null)
2020-01-07 22:22:08.481 DEBUG 24987 --- [fka-producer-network-thread | producer-1] org.apache.kafka.common.security.authenticator.SaslClientAuthenticator : Set SASL client state to SEND_HANDSHAKE_REQUEST
2020-01-07 22:22:08.481 DEBUG 24987 --- [fka-producer-network-thread | producer-1] org.apache.kafka.common.security.authenticator.SaslClientAuthenticator : Creating SaslClient: client=lpa/appX.dc.ex.com#DC.EX.COM;service=kafka;serviceHostname=mkav2.dc.ex.com;mechs=[GSSAPI]
2020-01-07 22:22:08.482 DEBUG 24987 --- [fka-producer-network-thread | producer-1] org.apache.kafka.common.network.Selector : Created socket with SO_RCVBUF = 32768, SO_SNDBUF = 131072, SO_TIMEOUT = 0 to node 101
2020-01-07 22:22:08.482 DEBUG 24987 --- [fka-producer-network-thread | producer-1] org.apache.kafka.common.security.authenticator.SaslClientAuthenticator : Set SASL client state to RECEIVE_HANDSHAKE_RESPONSE
2020-01-07 22:22:08.482 DEBUG 24987 --- [fka-producer-network-thread | producer-1] org.apache.kafka.clients.NetworkClient : Completed connection to node 101. Fetching API versions.
2020-01-07 22:22:08.484 DEBUG 24987 --- [fka-producer-network-thread | producer-1] org.apache.kafka.common.security.authenticator.SaslClientAuthenticator : Set SASL client state to INITIAL
2020-01-07 22:22:08.484 DEBUG 24987 --- [fka-producer-network-thread | producer-1] org.apache.kafka.common.network.Selector : Connection with mkav2.dc.ex.com/172.10.15.44 disconnected
javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]) occurred when evaluating SASL token received from the Kafka Broker. Kafka Client will go to AUTH_FAILED state.
at org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.createSaslToken(SaslClientAuthenticator.java:298)
at org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.sendSaslToken(SaslClientAuthenticator.java:215)
at org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.authenticate(SaslClientAuthenticator.java:183)
at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:76)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:376)
at org.apache.kafka.common.network.Selector.poll(Selector.java:326)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:433)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:224)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.security.sasl.SaslException: GSS initiate failed
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.kafka.common.security.authenticator.SaslClientAuthenticator$2.run(SaslClientAuthenticator.java:280)
at org.apache.kafka.common.security.authenticator.SaslClientAuthenticator$2.run(SaslClientAuthenticator.java:278)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.createSaslToken(SaslClientAuthenticator.java:278)
... 9 common frames omitted
Caused by: org.ietf.jgss.GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
... 14 common frames omitted
2020-01-07 22:22:08.484 DEBUG 24987 --- [fka-producer-network-thread | producer-1] org.apache.kafka.clients.NetworkClient : Node 101 disconnected.
2020-01-07 22:22:08.484 WARN 24987 --- [fka-producer-network-thread | producer-1] org.apache.kafka.clients.NetworkClient : Connection to node 101 terminated during authentication. This may indicate that authentication failed due to invalid credentials.
After restarting the application everything works fine for another 20 days or so and then I get the same exception again. These are the ticket properties in krb5.conf file:
ticket_lifetime = 86400
renew_lifetime = 604800
Any ideas on why this could be happening?
I have a log file formatted as follows and I want to split it into multiple files by day (ie. log-2017-10-2, log-2017-10-3 etc). I've seen people do it with awk but I'm not sure how to handle stack traces because java.io.Exception is a new line. Is there any convenient way to achieve this?
2017-10-02 04:26:02,534 INFO XXXXXXXXXXXXXXXXX
2017-10-03 04:26:02,543 INFO XXXXXXXXXXXX
2017-10-04 04:26:02,544 INFO XXXXXXXXX
2017-10-04 04:26:02,546 INFO XXXXXXXXXXXXX
2017-10-04 04:26:02,549 INFO XXXXXXXXXXX
2017-10-04 04:53:02,787 WARN class.class.class: [FetcherXXXXXX], Error in fetch XXXXXXXXXXXXXXXXXXXXXX
java.io.IOException: Connection to X was disconnected before the response was read
at XXXXXXXXXXXXXXXXXXXX
at XXXXXXXXXXXXXXXXXXXX
at XXXXXXXXXXXXXXXXXXXXX
at XXXXXXXXXXXXXXXX
at XXXXXXXXXXXXXXXX
2017-10-05 04:26:02,549 INFO XXXXXXXXXXX
Final file contents will be:
log-2017-10-2:
2017-10-02 04:26:02,534 INFO XXXXXXXXXXXXXXXXX
log-2017-10-3:
2017-10-03 04:26:02,543 INFO XXXXXXXXXXXX
log-2017-10-4:
2017-10-04 04:26:02,544 INFO XXXXXXXXX
2017-10-04 04:26:02,546 INFO XXXXXXXXXXXXX
2017-10-04 04:26:02,549 INFO XXXXXXXXXXX
2017-10-04 04:53:02,787 WARN class.class.class: [FetcherXXXXXX], Error in fetch XXXXXXXXXXXXXXXXXXXXXX
java.io.IOException: Connection to X was disconnected before the response was read
at XXXXXXXXXXXXXXXXXXXX
at XXXXXXXXXXXXXXXXXXXX
at XXXXXXXXXXXXXXXXXXXXX
at XXXXXXXXXXXXXXXX
at XXXXXXXXXXXXXXXX
log-2017-10-5:
2017-10-05 04:26:02,549 INFO XXXXXXXXXXX
awk to the rescue!
$ awk --posix 'BEGIN{f="log-header"}
$1~/^[0-9]{4}-[0-9]{2}-[0-9]{2}$/{f="log-"$1} {print > f}' log
if there are too many dates (corresponding to too many open files) you may need to close files at one point. For few hundred it should work as is.
The initial log file (log-header) is set in case your log doesn't start with the checked regex.
awk solution:
awk '/^[0-9]{4}-[0-9]{2}-[0-9]{2} /{
if (fn && !a[$1]++) close(fn);
fn="log-"$1
}{ print > fn }' logfile
/^[0-9]{4}-[0-9]{2}-[0-9]{2} / - on encountering line starting with date string
if(fn && !a[$1]++) close(fn) - close the previous opened file descriptor for the previous "date"
fn="log-"$1 - constructing filename
Viewing results:
$ head log-*
==> log-2017-10-02 <==
2017-10-02 04:26:02,534 INFO XXXXXXXXXXXXXXXXX
==> log-2017-10-03 <==
2017-10-03 04:26:02,543 INFO XXXXXXXXXXXX
==> log-2017-10-04 <==
2017-10-04 04:26:02,544 INFO XXXXXXXXX
2017-10-04 04:26:02,546 INFO XXXXXXXXXXXXX
2017-10-04 04:26:02,549 INFO XXXXXXXXXXX
2017-10-04 04:53:02,787 WARN class.class.class: [FetcherXXXXXX], Error in fetch XXXXXXXXXXXXXXXXXXXXXX
java.io.IOException: Connection to X was disconnected before the response was read
&XXXXXXXXXXXXXXXXXXXX
&XXXXXXXXXXXXXXXXXXXX
&XXXXXXXXXXXXXXXXXXXXX
&XXXXXXXXXXXXXXXX
&XXXXXXXXXXXXXXXX
==> log-2017-10-05 <==
2017-10-05 04:26:02,549 INFO XXXXXXXXXXX
I want to fetch log between two time stamps but i do not have specific time stamps with me. I can use the command sed for fetching if I have specific time stamp in log using the following command
sed -rne '/$StartTime/,/$EndTime/'p <filename>
My query is that since the specific StartTime and EndTime which I'm fetching from my DB might not be present in the log file, I will have to fetch the log between times near to the StartTime and EndTime that I provide using >= and <= signs. I tried the following command but it does not work.
awk '$0>=st && $0<=et' st=$StartTime et=$EndTime <filename>
Sample input and output
Input
Time retrieved from DB
StartTime - 2017-11-02 10:20:00
EndTime - 2017-11-02 11:20:00
The time present in log
T1 - 2017-11-02 10:17:44
T2 - 2017-11-02 11:19:32
Output: Entire Log text between T1 & T2
Sample Log
2017-03-03 10:43:18,736 [main] WARN - ORACLE_HOSTNAME=xxxxxxxxxx[OVERRIDES:
xxxxxxxxxxxxxxxx]
2017-03-03 10:43:18,736 [main] WARN - NLS_DATE_FORMAT=DD-MON-YYYY
HH24:MI:SS [OVERRIDES: DD-MON-YYYY HH24:MI:SS]
2017-03-03 10:43:18,736 [main] WARN - xxxxUsername=MDMPIUSER [OVERRIDES: MDMPIUSER]
2017-03-03 10:43:18,736 [main] WARN - BUNDLE_GEMFILE=uri:classloader://installer/Gemfile [OVERRIDES: uri:classloader://installer/Gemfile]
2017-03-03 10:43:18,736 [main] WARN - TIMEOUT=900 [OVERRIDES: 900]
2017-03-03 10:43:18,736 [main] WARN - SHLVL=4 [OVERRIDES: 4]
2017-03-03 10:43:18,736 [main] WARN - HISTSIZE=1000 [OVERRIDES: 1000]
2017-03-03 10:43:18,736 [main] WARN - JAVA_HOME=/usr/java/jdk1.8.0_60/jre [OVERRIDES: /usr/java/jdk1.8.0_60/jre]
2017-03-03 10:43:20,156 [main] WARN - APP_PROPS=/home/xxx/conf/appProperties [OVERRIDES: /home/xxx/conf/appProperties]
You can try
awk -v start="$StartTime" -v end="$EndTime" '
function fonct(date)
{
gsub(/-|,| |:/,"",date)
return date
}
BEGIN{
start=fonct(start)
end=fonct(end)
}
{
a=fonct($1$2)
if (a>=start && a<=end)print $0
}' infile
This is in reference to Need some help on shell scipt (using grep command)
I have a log file and I am trying to write a shell script to find pause/play happening or not (Success /Failure).
The script should do the following:
if ("trickModeRequest":{"rate":0.0})
{
if ("status":"OK")
printf("Pause Successful");
else if ("status":"GENERAL_ERROR")
printf("Pause failed");
else if ("status":"UNRECOGNIZED REQUEST")
printf("Pause failed");
}
Similarly:
if ("trickModeRequest":{"rate":1.0})
{
if ("status":"OK")
printf("Play Successful");
else if ("status":"GENERAL_ERROR")
printf("Play failed");
else if ("status":"UNRECOGNIZED REQUEST")
printf("Play failed");
}
My script file:
logfile=$1
awk '/trickModeRequest/ && /status\"\:\"GENERAL_ERROR/{ print "trickModeRequestFAILED" } /trickModeRequest/ && /status\":\"OK/ {print "trickModeRequestSUCCESS"} ' $logfile
The problem is that {"trickModeRequest":{"rate":0.0} or {"trickModeRequest":{"rate":1.0} is not placed in same paragraph of logs, so I dont know how to distinguish between pause/play...
My log file (logs.txt):
160125-11:11:16.654574 [mod=CARDLESSCA, lvl=DEBUG] [tid=2343] [RECORD]NAGRA_API:vlhal_CasPvrRecordingStart-post handle=0x007288a8
160125-11:11:16.654617 [mod=CARDLESSCA, lvl=INFO] [tid=2343] Recording nagra request for Recording vlhal_CasPvrPlaybackStart 0x7288a8
160125-11:11:16.655113 [mod=DVR, lvl=ERROR] [tid=2343] VLTune::process_buffer:1088(this=0x58ca00) - is_recording=1(was 0), hal_recording=0x7555ddd8
160125-11:11:16.905656 [mod=SYS, lvl=TRACE] [tid=2343] vl_env_get_bool: key=NAGRA.PRM.SUPPORT.ENABLED, result=1
160125-11:11:16.910125 [mod=SYS, lvl=TRACE] [tid=2343] vl_env_get_bool: key=NAGRA.PRM.SUPPORT.ENABLED, result=1
160125-11:11:16.911879 [mod=DVR, lvl=INFO] [tid=2343] HAL_RECORD_GetData:2989(0x7555ddd8) - /data/data/OCAP_MSV/0/0/DEFAULT_RECORDING_VOLUME/dvr/1453720276418.BOTF_Marker(worte=376)
160125-11:11:16.949874 [mod=SYS, lvl=INFO] [tid=2332] Read message '{"event":1,"handler":1,"name":"onRPCCall","params":{"callGUID":"228dd4a5-6727-4525-abce-331c87d54c18","callParams":[**{"trickModeRequest":{"rate":0.0}}**],"class":"com.comcast.xre.events.XRERPCCallInfo","destinationSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d",**"method":"trickModeRequest"**,"sourceSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d"},"phase":"STANDARD","source":1}
'
160125-11:11:16.950180 [mod=SYS, lvl=INFO] [tid=2332] ======= Message is onRPCCall ======>
160125-11:11:16.950326 [mod=SYS, lvl=INFO] [tid=2332] Entering onRPCCallEvent for request ---> trickModeRequest
160125-11:11:16.950621 [mod=SYS, lvl=INFO] [tid=2332] Received json request = {"event":1,"handler":1,"name":"onRPCCall","params":{"callGUID":"228dd4a5-6727-4525-abce-331c87d54c18","callParams":[**{"trickModeRequest":{"rate":0.0}**}],"class":"com.comcast.xre.events.XRERPCCallInfo","destinationSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d","method":"trickModeRequest","sourceSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d"},"phase":"STANDARD","source":1}
160125-11:11:16.950689 [mod=SYS, lvl=INFO] [tid=2332] trickModeRequest: trickModeRequest
160125-11:11:16.950794 [mod=TUNERSOURCE, lvl=INFO] [tid=2332] Setting rate 0.0, skipping 0 ms
160125-11:11:16.950872 [mod=TUNERSOURCE, lvl=INFO] [tid=2332] trickModeRequest() - Sending the TrickMode event
160125-11:11:16.950959 [mod=SYS, lvl=INFO] [tid=2332] PlaybackResponse: entered 0
160125-11:11:16.950994 [mod=MEDIA_SESSION_MGR, lvl=DEBUG] [tid=2331] MediaSessionManager::trickMode() - Handling trick mode
160125-11:11:16.951059 [mod=MEDIA_SESSION_MGR, lvl=DEBUG] [tid=2331] MediaSessionManager::trickMode() - rate = 0.00000, skip = 0 ms
160125-11:11:16.951123 [mod=PLAYER, lvl=TRACE] [tid=2331] PlayerBase(id=56a602d3)::stop - Before event: state = %s
160125-11:11:16.951182 [mod=PLAYER, lvl=DEBUG] [tid=2331] DecodePlayer(id=56a602d3)::DecodePlayerStatePresenting::stop - entering state %s
160125-11:11:16.951295 [mod=PLAYER, lvl=DEBUG] [tid=2331] PlayerBase(id=56a602d3)::PlayerStateBase::waitForExistingWorkerThread - waiting for the previous worker thread to stop
160125-11:11:16.951361 [mod=PLAYER, lvl=DEBUG] [tid=2331] PlayerBase(id=56a602d3)::PlayerStateBase::waitForExistingWorkerThread - done joining
160125-11:11:16.951535 [mod=PLAYER, lvl=TRACE] [tid=2331] PlayerBase(id=56a602d3)::stop - After event: state = %s
160125-11:11:16.951639 [mod=SYS, lvl=INFO] [tid=2332] ====== Response sending is {"appId":1,"command":"CALL","commandIndex":5,"method":"generateAppEvent","params":[{"class":"com.comcast.xre.events.XREOutgoingEvent","name":"onRPCReturn","params":{"callGUID":"228dd4a5-6727-4525-abce-331c87d54c18","class":"com.comcast.xre.events.XRERPCReturnInfo","destinationSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d",**"method":"trickModeRequest"**,"returnVal":{"class":"com.comcast.parker.SelectServiceSyncResponse","selectServiceStatus":"AWAITING_ASYNC","startTime":1296590759,"**status":"OK"**,"statusMessage":"Tuning to service: "},"sourceSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d"}},"ab00ebea-5f63-4619-9877-273a2bceea1d"],"targetId":1,"targetPath":"","timestamp":0}
====
160125-11:11:18.002455 [mod=SYS, lvl=INFO] [tid=2332] Read message '{"event":1,"handler":1,"name":"onRPCCall","params":{"callGUID":"fe11a665-2aad-404c-a887-bd3d967bbea0","callParams":[{**"trickModeRequest":{"rate":1.0}**}],"class":"com.comcast.xre.events.XRERPCCallInfo","destinationSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d",**"method":"trickModeRequest"**,"sourceSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d"},"phase":"STANDARD","source":1}
'
160125-11:11:18.002756 [mod=SYS, lvl=INFO] [tid=2332] ======= Message is onRPCCall ======>
160125-11:11:18.002843 [mod=SYS, lvl=INFO] [tid=2332] Entering onRPCCallEvent for request ---> trickModeRequest
160125-11:11:18.003106 [mod=SYS, lvl=INFO] [tid=2332] Received json request = {"event":1,"handler":1,"name":"onRPCCall","params":{"callGUID":"fe11a665-2aad-404c-a887-bd3d967bbea0","callParams":[{"trickModeRequest":{"rate":1.0}}],"class":"com.comcast.xre.events.XRERPCCallInfo","destinationSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d",**"method":"trickModeRequest"**,"sourceSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d"},"phase":"STANDARD","source":1}
160125-11:11:18.003168 [mod=SYS, lvl=INFO] [tid=2332] trickModeRequest: trickModeRequest
160125-11:11:18.003264 [mod=TUNERSOURCE, lvl=INFO] [tid=2332] Setting rate 1.0, skipping 0 ms
160125-11:11:18.003315 [mod=TUNERSOURCE, lvl=INFO] [tid=2332] trickModeRequest() - Sending the TrickMode event
160125-11:11:18.003387 [mod=SYS, lvl=INFO] [tid=2332] PlaybackResponse: entered 0
160125-11:11:18.003952 [mod=SYS, lvl=INFO] [tid=2332] ====== Response sending is {"appId":1,"command":"CALL","commandIndex":5,"method":"generateAppEvent","params":[{"class":"com.comcast.xre.events.XREOutgoingEvent","name":"onRPCReturn","params":{"callGUID":"fe11a665-2aad-404c-a887-bd3d967bbea0","class":"com.comcast.xre.events.XRERPCReturnInfo","destinationSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d",**"method":"trickModeRequest"**,"returnVal":{"class":"com.comcast.parker.SelectServiceSyncResponse","selectServiceStatus":"AWAITING_ASYNC","startTime":1296590759,**"status":"OK"**,"statusMessage":"Tuning to service: "},"sourceSessionGUID":"ab00ebea-5f63-4619-9877-273a2bceea1d"}},"ab00ebea-5f63-4619-9877-273a2bceea1d"],"targetId":1,"targetPath":"","timestamp":0}
====
160125-11:11:18.004268 [mod=MEDIA_SESSION_MGR, lvl=DEBUG] [tid=2331] MediaSessionManager::trickMode() - Handling trick mode
160125-11:11:18.004370 [mod=MEDIA_SESSION_MGR, lvl=DEBUG] [tid=2331] MediaSessionManager::trickMode() - rate = 1.00000, skip = 0 ms
160125-11:11:18.004429 [mod=PLAYER, lvl=DEBUG] [tid=2331] PlayerBase(id=56a602d3)::PlayerStateBase::waitForExistingWorkerThread - waiting for the previous worker thread to stop
160125-11:11:18.010346 [mod=SYS, lvl=INFO] [tid=2332] ====== Response of 685 bytes sent ======
160125-11:11:18.010529 [mod=SYS, lvl=INFO] [tid=2332] ProcessJsonData: returned success
160125-11:11:18.010633 [mod=SYS, lvl=INFO] [tid=2332] === OnConnect returned PS_SUCCESS
160125-11:11:18.011512 [mod=DVR, lvl=INFO] [tid=2349] HAL_PLAYBACK_Start:3789(0x596290) - NEXUS_Playpump_Open(NEXUS_ANY_ID, 0xc7e66800) returns 0x54052014)
160125-11:11:18.011555 [mod=SYS, lvl=TRACE] [tid=2343] vl_env_get_bool: key=NAGRA.PRM.SUPPORT.ENABLED, result=1
160125-11:11:18.013621 [mod=SYS, lvl=TRACE] [tid=2349] vl_env_get_bool: key=FEATURE.TRICKPLAY.CHANNEL_CHANGE.SLOW_MOTION, result=0
160125-11:11:18.013727 [mod=SYS, lvl=TRACE] [tid=2349] vl_env_get_bool: key=FEATURE.PARENTALCONTROL.POLICY.STRICT, result=1
160125-11:11:18.013786 [mod=DVR, lvl=INFO] [tid=2349] HAL_Playback_Init:355() - Using FEATURE.PARENTALCONTROL.POLICY.STRICT = true
Following your logic, you could use a variable type to keep track of the calls. Indeed, in your logs, you always have a line listing the Received json: ... callParams before the response with the status. If you always have this sequence (hard to guess with your example), you could do:
awk 'BEGIN { type=""; } /Received json/ && /"rate":1.0/ {type="play"} /Received json/ && /"rate":0.0/ {type="pause"} <your code>' $logfile
You can then use type in your printed message, which should be play or pause.
Complete code:
awk 'BEGIN{ type=""; } /trickModeRequest/ && /status\"\:\"GENERAL_ERROR/{ print type, "FAILED" } /trickModeRequest/ && /status\":\"OK/ {print type, "SUCCESSFUL"} /Received json/ && /"rate":1.0/ {type="play"} /Received json/ && /"rate":0.0/ {type="pause"} ' $logfile
Output:
pause SUCCESSFUL
play SUCCESSFUL
We tried to submit a simple SparkPI example onto Spark on Yarn. The bat is written as below:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 1g --executor-cores 1 .\examples\target\spark-examples_2.10-1.4.0.jar 10
pause
Our HDFS and Yarn works well. We are using Hadoop 2.7.0 and Spark 1.4.1. We have only 1 node that acts as both NameNode and DataNode.
When we execute it, it fails with log says the following:
2015-08-21 11:07:22,044 DEBUG [main] | ===============================================================================
2015-08-21 11:07:22,044 DEBUG [main] | Yarn AM launch context:
2015-08-21 11:07:22,044 DEBUG [main] | user class: org.apache.spark.examples.SparkPi
2015-08-21 11:07:22,044 DEBUG [main] | env:
2015-08-21 11:07:22,044 DEBUG [main] | CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__hadoop_conf__<CPS>{{PWD}}/__spark__.jar<CPS>%HADOOP_HOME%\etc\hadoop<CPS>%HADOOP_HOME%\share\hadoop\common\*<CPS>%HADOOP_HOME%\share\hadoop\common\lib\*<CPS>%HADOOP_HOME%\share\hadoop\mapreduce\*<CPS>%HADOOP_HOME%\share\hadoop\mapreduce\lib\*<CPS>%HADOOP_HOME%\share\hadoop\hdfs\*<CPS>%HADOOP_HOME%\share\hadoop\hdfs\lib\*<CPS>%HADOOP_HOME%\share\hadoop\yarn\*<CPS>%HADOOP_HOME%\share\hadoop\yarn\lib\*<CPS>%HADOOP_MAPRED_HOME%\share\hadoop\mapreduce\*<CPS>%HADOOP_MAPRED_HOME%\share\hadoop\mapreduce\lib\*
2015-08-21 11:07:22,060 DEBUG [main] | SPARK_YARN_CACHE_FILES_FILE_SIZES -> 165181064,1420218
2015-08-21 11:07:22,060 DEBUG [main] | SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1440062075415_0026
2015-08-21 11:07:22,060 DEBUG [main] | SPARK_YARN_CACHE_FILES_VISIBILITIES -> PRIVATE,PRIVATE
2015-08-21 11:07:22,060 DEBUG [main] | SPARK_USER -> msrabi
2015-08-21 11:07:22,060 DEBUG [main] | SPARK_YARN_MODE -> true
2015-08-21 11:07:22,060 DEBUG [main] | SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1440126441200,1440126441575
2015-08-21 11:07:22,060 DEBUG [main] | SPARK_YARN_CACHE_FILES -> hdfs://msra-sa-44:9000/user/msrabi/.sparkStaging/application_1440062075415_0026/spark-assembly-1.4.0-hadoop2.7.0.jar#__spark__.jar,hdfs://msra-sa-44:9000/user/msrabi/.sparkStaging/application_1440062075415_0026/spark-examples_2.10-1.4.0.jar#__app__.jar
2015-08-21 11:07:22,060 DEBUG [main] | resources:
2015-08-21 11:07:22,060 DEBUG [main] | __app__.jar -> resource { scheme: "hdfs" host: "msra-sa-44" port: 9000 file: "/user/msrabi/.sparkStaging/application_1440062075415_0026/spark-examples_2.10-1.4.0.jar" } size: 1420218 timestamp: 1440126441575 type: FILE visibility: PRIVATE
2015-08-21 11:07:22,060 DEBUG [main] | __spark__.jar -> resource { scheme: "hdfs" host: "msra-sa-44" port: 9000 file: "/user/msrabi/.sparkStaging/application_1440062075415_0026/spark-assembly-1.4.0-hadoop2.7.0.jar" } size: 165181064 timestamp: 1440126441200 type: FILE visibility: PRIVATE
2015-08-21 11:07:22,060 DEBUG [main] | __hadoop_conf__ -> resource { scheme: "hdfs" host: "msra-sa-44" port: 9000 file: "/user/msrabi/.sparkStaging/application_1440062075415_0026/__hadoop_conf__7908628615251032149.zip" } size: 82888 timestamp: 1440126441794 type: ARCHIVE visibility: PRIVATE
2015-08-21 11:07:22,060 DEBUG [main] | command:
2015-08-21 11:07:22,075 DEBUG [main] | {{JAVA_HOME}}/bin/java -server -Xmx4096m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.app.name=org.apache.spark.examples.SparkPi' '-Dspark.executor.memory=1g' '-Dspark.driver.memory=4g' '-Dspark.master=yarn-cluster' -Dspark.yarn.app.container.log.dir=<LOG_DIR> org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar file:/D:/sp/./examples/target/spark-examples_2.10-1.4.0.jar --arg '10' --executor-memory 1024m --executor-cores 1 --num-executors 3 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
2015-08-21 11:07:22,075 DEBUG [main] | ===============================================================================
...........(omitting some lines)......
2015-08-21 11:07:23,231 INFO [main] | Application report for application_1440062075415_0026 (state: ACCEPTED)
2015-08-21 11:07:23,247 DEBUG [main] |
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1440126442169
final status: UNDEFINED
tracking URL: http://msra-sa-44:8088/proxy/application_1440062075415_0026/
user: msrabi
2015-08-21 11:07:24,263 TRACE [main] | 1: Call -> MSRA-SA-44/10.190.173.181:8032: getApplicationReport {application_id { id: 26 cluster_timestamp: 1440062075415 }}
2015-08-21 11:07:24,263 DEBUG [IPC Parameter Sending Thread #0] | IPC Client (443384617) connection to MSRA-SA-44/10.190.173.181:8032 from msrabi sending #37
2015-08-21 11:07:24,263 DEBUG [IPC Client (443384617) connection to MSRA-SA-44/10.190.173.181:8032 from msrabi] | IPC Client (443384617) connection to MSRA-SA-44/10.190.173.181:8032 from msrabi got value #37
2015-08-21 11:07:24,263 DEBUG [main] | Call: getApplicationReport took 0ms
2015-08-21 11:07:24,263 TRACE [main] | 1: Response <- MSRA-SA-44/10.190.173.181:8032: getApplicationReport {application_report { applicationId { id: 26 cluster_timestamp: 1440062075415 } user: "msrabi" queue: "default" name: "org.apache.spark.examples.SparkPi" host: "N/A" rpc_port: -1 yarn_application_state: ACCEPTED trackingUrl: "http://msra-sa-44:8088/proxy/application_1440062075415_0026/" diagnostics: "" startTime: 1440126442169 finishTime: 0 final_application_status: APP_UNDEFINED app_resource_Usage { num_used_containers: 1 num_reserved_containers: 0 used_resources { memory: 4608 virtual_cores: 1 } reserved_resources { memory: 0 virtual_cores: 0 } needed_resources { memory: 4608 virtual_cores: 1 } memory_seconds: 0 vcore_seconds: 0 } originalTrackingUrl: "N/A" currentApplicationAttemptId { application_id { id: 26 cluster_timestamp: 1440062075415 } attemptId: 1 } progress: 0.0 applicationType: "SPARK" }}
2015-08-21 11:07:24,263 INFO [main] | Application report for application_1440062075415_0026 (state: ACCEPTED)
.......(omitting some lines where the state are all ACCEPTED and final status are all UNDEFINED).....
2015-08-21 11:07:30,359 INFO [main] | Application report for application_1440062075415_0026 (state: FAILED)
2015-08-21 11:07:30,359 DEBUG [main] |
client token: N/A
diagnostics: Application application_1440062075415_0026 failed 2 times due to AM Container for appattempt_1440062075415_0026_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://msra-sa-44:8088/cluster/app/application_1440062075415_0026Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1440062075415_0026_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Shell output: 1 file(s) moved.
And then we opened stderr, it says:
Error: Could not find or load main class 'Dspark.app.name=org.apache.spark.examples.SparkPi'
It's so strange, this should be a parameter passed to java, and it seems that java recognized it as the main class. There should be a main class parameter in the command section of the log, but there is not.
How can that happen? What should we do to know what's wrong with it?
Thank you!
We solved this problem.
The root cause is that when generating the java command line, our Spark uses single quote('-Dxxxx') to wrap the parameters. Single quote works only in Linux. On Windows, the parameters are either not wrapped, or wrapped with double quotes("-Dxxxx"). The only way to solve this is to edit the source code of Spark and re-compile it.
It seems that this is currently an issue of Spark. (https://issues.apache.org/jira/browse/SPARK-5754)