Wondering if there is any way to somehow hide sqoop process output in Unix shell?
For example instead of that output put some text like "sqoop processing"
Thanks
The way I deal with this for pig scripts (which also tend to give a lot of output, and run for a long time) is as follows:
Rather than running
pig mypath/myscript.pig
I will run
nohup pig mypath/myscript.pig &
In your case that would mean something like
nohup oozie -job something &
This has the additional benefit that it will not stop your query if your SSH connection times out. If you do not use SSH at the moment, this may be an additional required step.
Related
I have a bunch of bash scripts that I run sequentially. I'm going to consolidate to a single script but there's one part that's a bit tricky. Specifically, script C launches a Google Compute Engine job and I only want script D (the one immediately following it) to execute once that's done.
Is there a good way of doing this?
In case it helps, my new script would be:
source script_A.sh
source script_B.sh
source script_C.sh
**wait until cloud job has finished**
source script_D.sh
Thanks!
After gcloud ... & is called, use gcloudpid=$! (I don't think you have to export, but it wouldn't hurt) to grab its pid. then your main script will be
source script_C.sh
wait $gcloudpid
source script_D.sh
Given a project that consists of large number of bash scripts that are being launched from crontab periodically how can one track execution time of each script?
There is straightforward approach to edit each of those file by adding date
But what I really want is some kind of daemon that could track execution time and submit results to somewhere several times a day.
So the question is:
Is it possible to gather information about execution time of 200 bash scripts without editing each of them?
time module considered as fallback solution, if nothing better could be found
Depending on your systems cron implementation you may define the log-levels of the cron daemon. For ubuntus default vixie-cron setting log-level will log start and end of a job-execution which can then be analyzed.
On current LTS Ubuntu it works defining the log-level in /etc/init/cron
appending the -L 3 option to the exec line letting it look like:
exec cron -L 3
You could change your cron to run your scripts under time?
time scriptname
And pipe output to you logs.
I am on Hadoop 2.2.0, running a Single Node setup.
My understanding is that hdfs dfs -ls is slow because it is spinning up a JVM every time it is invoked.
Is there any way to make it keep the JVM running so simple commands can complete faster?
I would like to inform you about a solution we did to solve this problem.
We created a new utility - HDFS Shell to work with HDFS more faster.
https://github.com/avast/hdfs-shell
HDFS DFS initiates JVM for each command call, HDFS Shell does it only once - which means great speed enhancement when you need to work with HDFS more often
Commands can be used in short way - eg. hdfs dfs -ls /, ls / - both will work
HDFS path completion using TAB key
we can easily add any other HDFS manipulation function
there is a command history persisting in history log (~/.hdfs-shell/hdfs-shell.log)
support for relative directory + commands cd and pwd
and much more...
In the pig grunt shell commands like fs -ls work quite fast, so that might be a pragmatic workaround. The problem is that this doesn't work well when trying to pipe the output to other commands.
So I hacked a script together to start the pig grunt shell as a background process and communicate with it via named pipes: https://unix.stackexchange.com/a/144722/46085. The problem is that even though I use the script tool to fake a real terminal (because the grunt shell expects that), the grunt shell still kills itself sometimes. I also get problems when truncating the output with head or so, because it still tries to write the whole output which in turn can leave stale output in the named pipe.
Anyway you might have a look and see if it works for you. I appreciate any improvements you may find.
Check out Hadoop Tools. It provides a similar interface to hdfs dfs but is much faster. It also supports tab completion of filenames on HDFS with bash completion, which is a huge time saver.
It doesn't support put yet though.
I have couple of questions around parameter substitution in Pig.
I am on Pig 0.10
Can i access unix environemnt variables in grunt shell ? In Hive we can do this via ${env:variable}
I have bunch of Pig scripts that are automated and running in batch mode. I have used bunch of parameters inside it and I substitute them from command line (either -param or -param_file). When i need to enhance (or debug) the pig script in grunt mode, i am left with manually replacing the parameters with the value. Is there a better way of handling this situations.
Thanks for the help !
For the first question, Pig does not support to use the environment. Is there any special requirement? You should be able to pass the environment by the Pig command line parameters.
For the second question, now Pig does not support to use parameters in Grunt. You can check the issue and discussion in PIG-2122. Aniket Mokashi suggests to use the following way:
Store your script line in a file (with $params included).
Start grunt interactively
type run -param a=b -param c=d myscript.pig
I have installed cygwin, hadoop and pig in windows. The configuration seems ok, as I can run pig scripts in batch and embedded mode.
When I try to run pig in grunt mode, something strange happens. Let me explain.
I try to run a simple command like
grunt> A = load 'passwd' using PigStorage(':');
When I press Enter, nothing happens. The cursor goes to the next line and the grunt> prompt does not appear at all anymore. It seems as I am typing in a text editor.
Has anything similar ever happened to you? Do you have any idea how can I solve this?
The behavior is consistent with what you are observing. I will take the pig tutorial for example.
The following command does not result in any activity by pig.
raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);
But if you invoke a command that results in using data from variable raw using some map-reduce thats when you will see some action in your grunt shell. Some thing along the lines of second command that is mentioned there.
clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
Similarly, your command will not result in any action, you have to use the data from variable A which results in map-reduce command to see some action on grunt shell:
grunt> A = load 'passwd' using PigStorage(':');
Pig will only process the commands when you use a command that creates output namely DUMP (to console) or STORE you can also use command DESCRIBE to get the structure of an alias and EXPLAIN to see the map/reduce plan
so basically DUMP A; will give you all the records in A
Please try to run in the windows command window.
C:\FAST\JDK64\1.6.0.31/bin/java -Xmx1000m -Dpig.log.dir=C:/cygwin/home/$USERNAME$/nubes/pig/logs -Dpig.log.file=pig.log -Dpig.home.dir=C:/cygwin/home/$USERNAME$/nubes/pig/ -classpath C:/cygwin/home/$USERNAME$/nubes/pig/conf;C;C:/FAST/JDK64/1.6.0.31/lib/tools.jar;C:/cygwin/home/$USERNAME$/nubes/pig/lib/jython-standalone-2.5.3.jar;C:/cygwin/home/$USERNAME$/nubes/pig/conf;C:/cygwin/home/$USERNAME$/nubes/hadoop/conf;C:/cygwin/home/$USERNAME$/nubes/pig/pig-0.11.1.jar org.apache.pig.Main -x local
Replace $USERNAME$ with your user id accordingly ..
Modify the class path and conf path accordingly ..
It works well in both local as well as map reduce mode ..
Pig shell hangs up in cygwin. But pig script successfully executed from pig script file.
As below:
$pig ./user/input.txt
For local mode:
pig -x local ./user/input.txt
I came across the same problem as you yesterday,and I spent one whole day to find what was wrong with my pig or my hotkey and fix it finally. I found that it's only because I copied the pig code from other resource,then the bending quotation marks cannot be identified in pig command line, which only admits straight quotation marks, so the input stream would not end.
My suggestion is that you should take care of the valid characters in the code, especially when you just copy codes into the command line, which always causes unexpected faults.