Pig script scheduled by crontab not giving result - hadoop

I have pig script which when I run from pig(map reduce mode) gives proper result but when I schedule from crontab does not store output as per the script.
Pig script is,
a1 = load '/user/training/abhijit_hdfs/id' using PigStorage('\t') as (id:int,name:chararray,desig:chararray);
a2 = load '/user/training/abhijit_hdfs/trips' using PigStorage('\t') as (id:int,place:chararray,no_trips:int);
j = join a1 by id,a2 by id;
g = group j by(a1::id,a1::name,a1::desig);`
`su = foreach g generate group,SUM(j.a2::no_trips) as tripsum;
ord = order su by tripsum desc;
f2 = foreach ord generate $0.$0,$0.$1,$0.$2,$1;
store f2 into '/user/training/abhijit_hdfs/results/trip_output' using PigStorage(' ');
Crontab is,
[training#localhost ~]$ crontab -l
40 3 * * * /home/training/Abhijit_Local/trip_crontab.pig
Please Guide.

Your crontab is attempting to treat the Pig script as an executable file and run it directly. Instead, you will likely need to pass it through the pig command explicitly, as described in the Apache Pig documentation on Batch Mode. You may also find it helpful to redirect stdout and stderr output to a log file somewhere in case you need to troubleshoot failures.
40 3 * * * pig /home/training/Abhijit_Local/trip_crontab.pig 2>&1 > /some/path/to/logfile
Depending on PATH environment variable settings, you might find that it's necessary to specify the absolute path to the pig command.
40 3 * * * /full/path/pig /home/training/Abhijit_Local/trip_crontab.pig 2>&1 > /some/path/to/logfile

Related

Efficient copy method in Hadoop

Is there a faster or more efficient way of copying files across HDFS other than distcp. I tried both the regular hadoop fs -cp as well as distcp and both seem to be giving the same transfer rate, around 50 MBPS.
I have 5TB of data split into smaller files of 500GB each which I have to copy to a new location on HDFS. Any thoughts?
Edit:
The original distcp is only spawning 1 mapper so I added -m100 option to increase the mappers
hadoop distcp -D mapred.job.name="Gigafiles distcp" -pb -i -m100 "/user/abc/file1" "/xyz/aaa/file1"
But still it is spawning only 1 and not 100 mappers. Am I missing something here?
I came up with this if you want to copy a subset of files from a folder to another in HDFS. It may not be as efficient as distcp but does the job and gives you more freedom in case you want to do other operations. It also checks if each file already exists there:
import pandas as pd
import os
from multiprocessing import Process
from subprocess import Popen, PIPE
hdfs_path_1 = '/path/to/the/origin/'
hdfs_path_2 = '/path/to/the/destination/'
process = Popen(f'hdfs dfs -ls -h {hdfs_path_2}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
already_processed = [fn.split()[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
print(f'Total number of ALREADY PROCESSED tar files = {len(already_processed)}')
df = pd.read_csv("list_of_files.csv") # or any other lists that you have
to_do_tar_list = list(df.tar)
to_do_list = set(to_do_tar_list) - set(already_processed)
print(f'To go: {len(to_do_list)}')
def copyy(f):
process = Popen(f'hdfs dfs -cp {hdfs_path_1}{f} {hdfs_path_2}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
if std_out!= b'':
print(std_out)
ps = []
for f in to_do_list:
p = Process(target=copyy, args=(f,))
p.start()
ps.append(p)
for p in ps:
p.join()
print('done')
Also if you want to have a list of all files in a directory use this:
from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
I was able to solve this by using a pig script to read the data from path A, convert to parquet (which is the desired storage format anyway) and write it in path B. The process took close to 20 mins on average for 500GB files. Thank you for the suggestions.

sqlcmd runs fine and cmd line but will not produce output as a exec cmd step in a SQL Agent job

I am pulling my hair out and have tried every posting suggest.
I have a tsql script called C:\DBAReports\testsql.sql. If I go to a command logged prompt on my server, and run: sqlcmd -S localhost -i C:\DBAReports\testsql.sql -o C:\DBAReports\testout.txt
But if I create an new agent job with 1 step of type Operating system (CmdExec) to run as a SQL Server Agent Service Account, On Success Quit the job reporting success and on Failure Quit the job reporting failure. with the owner my same admin windows login as when I run the cmd prompt, right click on the agent job and start at step 1, I get the job succeeded (Job was invoked by my windows login), and the Step 1 is Executed as user is-sql "The step did not generate any output. Process Exit Code 0. The step was successful".
But it doesn't write the output file.
Any ideas?
The reason I want to do this is I am getting periodic Error: 18057, Severity: 20, State: 2 Failed to set up execution content in my sql server log. What I hope to do is kick off this job when this occurs to try and find out what are the SPIDs, status, SQL running, etc and write it to an output file.
My testsql.sql script contains.
SELECT
SPID = er.session_id
,STATUS = ses.STATUS
,[Login] = ses.login_name
,Host = ses.host_name
,BlkBy = er.blocking_session_id
,DBName = DB_Name(er.database_id)
,CommandType = er.command
,SQLStatement = st.text
,ObjectName = OBJECT_NAME(st.objectid)
,ElapsedMS = er.total_elapsed_time
,CPUTime = er.cpu_time
,IOReads = er.logical_reads + er.reads
,IOWrites = er.writes
,LastWaitType = er.last_wait_type
,StartTime = er.start_time
,Protocol = con.net_transport
,ConnectionWrites = con.num_writes
,ConnectionReads = con.num_reads
,ClientAddress = con.client_net_address
,Authentication = con.auth_scheme
FROM sys.dm_exec_requests er
OUTER APPLY sys.dm_exec_sql_text(er.sql_handle) st
LEFT JOIN sys.dm_exec_sessions ses
ON ses.session_id = er.session_id
LEFT JOIN sys.dm_exec_connections con
ON con.session_id = ses.session_id
Thanks in advance for any help. I have tried so many suggestions, and either get syntax errors on the command, and when I don't get any syntax error on the sqlcmd, it just generates no output.
An alternate way - try modifying the job step as Type: T-SQL script and the command as:
EXEC master..xp_CMDShell 'C:\Work\temp\test3.bat'
Please replace the bat file path with yours.

How to Load the Data with out text qualifiers using PIG/HIVE/Hbase?

I Have One CSV file, Which Contains text qualifier(" ") data. I want to load the data into hdfs using PIG/Hive/Hbase without text qualifiers. plz give your help
my file input.CSV
"Id","Name"
"1","Raju"
"2","Anitha"
"3","Rakesh"
I want output like:
Id,Name
1,Raju
2,Anitha
3,Rakesh
Try this in pig script
Suppose your input file name is input.csv
1.First move this input file to HDFS using copyfromlocal command.
2. Run this below pig script
PigScript:
HDFS mode:
A = LOAD 'hdfs://<hostname>:<port>/user/test/input.csv' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'"(.*)","(.*)"')) AS (id:int,name:chararray);
STORE B INTO '/user/test/output' USING PigStorage(',');
Local mode:
A = LOAD 'input.csv' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'"(.*)","(.*)"')) AS (id:int,name:chararray);
STORE B INTO 'output' USING PigStorage(',');
Output:
Id,Name
1,Raju
2,Anitha
3,Rakesh

Cannot compute MAX

Setup data
mkdir data
echo -e "1\n2\n3\n4\n8\n4\n3\n6" > data/data.txt
Launch Pig in local mode
pig -x local
Script
a = load 'data' Using PigStorage() As (value:int);
b = foreach a generate MAX(value);
dump b;
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.MAX as multiple or none of them fit. Please use an explicit cast.
Just found the answer, it just take a GROUP ALL before calling the function ... Kind of feel the error message could be a little clearer ...
a = load 'data' Using PigStorage() As (value:int);
b = GROUP a ALL;
c = foreach b generate MAX(a.value);
dump c;
> 8

Magento stock update with csv

I am using the following script
http://www.sonassi.com/knowledge-base/magento-kb/mass-update-stock-levels-in-magento-fast/
It works beautifully with the test CSV file.
My POS creates a CSV file but it puts a different heading so the script does not work. I want to automate the process. Is there any way to change the names of headers automatically?
The script requires the headers to be
“sku”,”qty”
my CSV is
“ITEM”,”STOCK”
Is there any way for these two different names to be linked within the script so that my script sees ITEM as sku and STOCK as qty?
You should create a php script with an input of the yourfilename.csv, which is the unformatted file.
$file = file_get_contents('yourfilename.csv');
$file = str_replace('ITEM', 'sku', $file);
$file = str_replace('STOCK', 'qty', $file);
file_put_contents('yourfilename.csv', $file);
The below links are for your reference.
find and replace values in a flat-file using PHP
http://forums.phpfreaks.com/index.php?topic=327900.0
Hope it helps.
Cheers
PHP isn't usually the best way to go for file manipulation granting the fact you have SSH access.
You could also run the following commands (if you have perl installed, which is default in most setups...):
perl -pi -e 's/ITEM/sku/g' /path/to/your/csvfile.csv
perl -pi -e 's/STOCK/qty/g' /path/to/your/csvfile.csv
If you want qty update using raw sql way then you can create a function like below:
function _updateStocks($data){
    $connection     = _getConnection('core_write');
    $sku            = $data[0];
    $newQty         = $data[1];
    $productId      = _getIdFromSku($sku);
    $attributeId    = _getAttributeId();
 
    $sql            = "UPDATE " . _getTableName('cataloginventory_stock_item') . " csi,
                       " . _getTableName('cataloginventory_stock_status') . " css
                       SET
                       csi.qty = ?,
                       csi.is_in_stock = ?,
                       css.qty = ?,
                       css.stock_status = ?
                       WHERE
                       csi.product_id = ?
                       AND csi.product_id = css.product_id";
    $isInStock      = $newQty > 0 ? 1 : 0;
    $stockStatus    = $newQty > 0 ? 1 : 0;
    $connection->query($sql, array($newQty, $isInStock, $newQty, $stockStatus, $productId));
}
And call the above function by passing csv row data as arguments. This is just a hint.
In order to get full working code with details you can refer to the following blog article:
Updating product qty in Magento in an easier & faster way
Hope this helps!

Resources