I try to save json that is a result of bash command in the bucket of gcs. Executing the bash command in my local terminal everything works properly and it loads data into gcs. Unfortunately the same bash command doesn't work via Airflow. Airflow marks the task as successfully done but in gcs I can see empty file. I suspect that this happens because of the out of airflow memory but I am not sure. If so someone can explain me how and where the results are stored in airflow ? I see in the bash operator documentation that airflow creates a temporary directory which is cleaned after the execution. Does it mean that the results of bash command also are cleaned afterwards ? Is there any way to save the results in gcs ?
This is my dag:
get_data = BashOperator(
task_id='get_data',
bash_command='curl -X GET -H 'XXX: xxx' some_url | gsutil cp -L manifest.txt - gs://bucket/folder1/filename.json; rm manifest.txt',
dag=dag
)
I found following error while execute Crontab.
"/home/sqoop/proddata/test.sh: 2: /home/sqoop/proddata/test.sh: hdfs: not found"
I run the following command in the terminal, and also the same command I mention in Script and both are working but using cron it does not work
# hdfs dfs -mkdir /user/hadoop/table/testtable
# vim test.sh
hdfs dfs -mkdir /user/hadoop/table/testtable
I was trying to list the set of tables present in the hbase using the below script:
#!/bin/bash
/home/user/hbase-1.2.4/bin/hbase shell << eof > /home/user/myfile.txt
list 'RAW_5_.*'
eof
I am able to get the table list while i run the script in the bash terminal using ::
sh script.sh
, but its creating a 0kb file while running using the crontab. I have given the absolute path for the hbase.
Can anyone help on this bottleneck please?
Since it is executing properly from terminal and not in crontab, try loading user bash profile in the script instead of #!/bin/bash ie,
change
#!/bin/bash
to
. ~/.bash_profile
Usually #!/bin/bash is included in bash_profile and it will have user specific configurations as well.
Option to execute shell script using gsutil is nowhere mentioned in the document. Tried some options but still no luck.I have a .sh file ,stored in a storage bucket,is there any way to execute this script using gsutil ?
gsutil doesn't support direct execution shell scripts, but you could pipe it to a shell, for example:
gsutil cat gs://your-bucket/your-script.sh | sh
I have a csv file sample.csv and located in \home\hadoop\Desktop\script\sample.csv .
I tried to load in PIG using
movies = load '/home/hadoop/Desktop/script/sample.csv' using PigStorage(',') as (id,name,year,rating,duration);
But this PIG statement is giving an error but while giving statement as dump movies;, it is throwing error and showing input and output is failed.
Please suggest me how to load the data using pig statement.
If your input file is at local then you can enter into grunt shell by typing pig -x local
If you enter into grunt shell then you can type the below statement
record = LOAD '/home/hadoop/Desktop/script/sample.csv' using PigStorage(',') as (id:int,name:chararray,year:chararray,rating:chararray,duration:int);
dump record;
If your input file is not at local then first you need to copy that file from local to HDFS using below command
hadoop dfs -put <path of file at local> <path of hdfs dir>
Once your file is loaded into HDFS you can enter to map reduce mode by typing pig
again grunt shell will be opened. ia assuming that your HDFS location is something like below LOAD statement
record = LOAD '/user/hadoop/inputfiles/sample.csv' using PigStorage(',') as (id:int,name:chararray,year:chararray,rating:chararray,duration:int);
dump record;
You can also use copyFromLocal command in grunt shell to move local file to hdfs.
open pig shell in local mode by pig -x local and if your file present at hdfs then you can use pig to open grant shell.
$pig -x local
grunt> movies = load '/home/hadoop/Desktop/script/sample.csv' using PigStorage(',') as (id:int,name:chararray,year:chararray,rating:chararray,duration:chararray);
grunt> dump movies;