Importing data from sas to Hadoop - hadoop

I want to export one sas file from sas to hadoop sdp database.
It's exporting but taking almost 10 hrs which i want to reduce. Can anyone suggest how I can fix this.
I am using this code
Libname sdpuwa impala dsn=xxx pw=xxx database=xxxx;
Data sdpuwa.imapal_table;
Set sas_table;
Run;

Try the bulkload=yes libname option.
libname sdpuwa impala dsn=xxx pw=xxx database=xxxx bulkload=yes;
data sdpuwa.imapal_table;
set sas_table;
run;

Related

Impossible to delete table in SAS Enterprise Guide

I can't process to deleting my table. I get this error:
"ERROR: File SASUSER.MCO.DATA is not a SAS data set."
I've tried many ways to delete but neither works.
Thanks for felp!
tested with (proc delete / proc sql drop / %deltable )
I have used the bellow codes:
proc sql; drop table sasuser.MCO; quit;
%deltable (tables=sasuser.MCO)
proc datasets nolist lib=sasuser; delete MCO ; quit;
log and result of proc datasets lib=sasuser; run;
this is the log:
here
and the result is here
From your error message it seems that the file is NOT an actual SAS dataset. I have never seen a SAS dataset on Unix that as only one thousand bytes long, even an empty dataset is normally more like 14K, depending on the default block size that SAS uses to create the files.
So just use the operating system to delete the file. The name of the file should be mco.sas7bdat and it should be in the directory that the SASUSER libref is pointing to. So if you have XCMD option active you could just use code like this:
x "rm %sysfunc(pathname(sasuser))/mco.sas7bdat";
If XCMD is not active then you will need to use the FDELETE() function instead.

Sqoop date issue when importing from oracle

I'm trying to import a huge table from oracle 10g to HDFS (GCS since i'm using sqoop with Google Cloud Dataproc) as AVRO. Everything works fine when the table doesnt have any date columns, but when it does some dates are imported very wrong.
Like: Oracle data -> 30/07/76 and HDFS data -> 14976-07-30 20:02:00.0
Like: Oracle data -> 26/03/84 and HDFS data -> 10384-03-26 20:32:34.0
I'm already mapping the date fields as String to bring them like that. I was importing using the default sqoop way that is bringing the date fields as epoch ints but the conversion was incorrect too.
Like: Oracle data -> 01/01/01 and HDFS data -> -62135769600000 when it should be 978314400000
Please, hope someone help me to fix this issue.
Thanks
Aditional information:
Sqoop command that i'm running
import -Dmapreduce.job.user.classpath.first=true -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect=$JDBC_STR --username=$USER --password=$PASS --target-dir=gs://sqoop-dev-out-files/new/$TABLE --num-mappers=10 --fields-terminated-by="\t" --lines-terminated-by="\n" --null-string='null' --null-non-string='null' --table=$SCHEMA.$TABLE --as-avrodatafile --map-column-java="DATACADASTRO=String,DATAINICIAL=String,DATAFINAL=String"
Sqoop version: 1.4.7
JDBC version: 6
I think your date in oracle is 01/01/0001, try to_char(COLUMN,'DD/MM/YYYY').
My issue is that my date is really 01/01/0001, because of user mistyping, and I can't update the column in the origin oracle database.
My issue is that converting to unix should have come -62135596800000 but instead, it comes -62135769600000(30/12/0000).
At first, I thought that was a timezone issue but it is two days difference.

sparklyr write data to hdfs or hive

I tried using sparklyr to write data to hdfs or hive , but was unable to find a way . Is it even possible to write a R dataframe to hdfs or hive using sparklyr ? Please note , my R and hadoop are running on two different servers , thus I need a way to write to a remote hdfs from R .
Regards
Rahul
Writing Spark table to hive using Sparklyr:
iris_spark_table <- copy_to(sc, iris, overwrite = TRUE)
sdf_copy_to(sc, iris_spark_table)
DBI::dbGetQuery(sc, "create table iris_hive as SELECT * FROM iris_spark_table")
As of latest sparklyr you can use spark_write_table. pass in the format database.table_name to specify a database
iris_spark_table <- copy_to(sc, iris, overwrite = TRUE)
spark_write_table(
iris_spark_table,
name = 'my_database.iris_hive ',
mode = 'overwrite'
)
Also see this SO post here where i got some input on more options
You can use sdf_copy_to to copy a dataframe into Spark, lets say tempTable. Then use DBI::dbGetQuery(sc, "INSERT INTO TABLE MyHiveTable SELECT * FROM tempTable") to insert the dataframe records in a hive table.

How does --direct parameter in Sqoop export work with Vertica?

I got Too many ROS containers ... error when exporting large amount of data from HDFS to Vertica. I know there is a direct option for vsql COPY which will bypass the WOS and load data into ROS containers. I also notice the --direct in Sqoop Export, see this Sqoop User Guide. I'm just wondering if these two "direct" have same function.
I have tried modify Vertica configuration parameters like MoveOutInterval, MergeOutInterval... But this didn't help much.
So does anyone know if direct mode of Sqoop export will help to solve the ROS containers issue. Thanks!
--direct is only supported by specific database connectors. Since there isn't one for Vertica, you would be using the Generic JDBC one. I really doubt using --direct does anything... but if you really want to test this you can look at the statement sent in query_requests.
select *
from query_requests
where request_type = 'LOAD'
and start_timestamp > clock_timestamp() - interval '1 hour'
That will show you all load statements within the last hour. The sqoop statements should get converted to a COPY. I would really hope anyhow! If it is a bunch of INSERT ... VALUES statements then I highly suggest NOT using it. If it is not producing a COPY then you'll need to change the query above to look for the INSERT.
select *
from query_requests
where request_type = 'QUERY'
and request ilike 'insert%'
and start_timestamp > clock_timestamp() - interval '1 hour'
Let me know what you find here. If it is doing INSERT...VALUES then I can tell you how to fix it (but it is a bit of work).

How can I export contents of an oracle table to a file?

Getting ready to clean up some old tables which are no longer in use, but I would like to be able to archive the contents before removing them from the database.
Is it possible to export the contents of a table to a file? Ideally, one file per table.
You can use Oracle's export tool: exp
Edit:
exp name/pwd#dbname file=filename.dmp tables=tablename rows=y indexes=n triggers=n grants=n
You can easily do it using Python and cx_Oracle module.
Python script will extract data to disk in CSV format.
Here’s how you connect to Oracle using Python/cx_Oracle:
constr='scott/tiger#localhost:1521/ORCL12'
con = cx_Oracle.connect(constr)
cur = con.cursor()
After data fetch you can loop through Python list and save data in CSV format.
for i, chunk in enumerate(chunks(cur)):
f_out.write('\n'.join([column_delimiter.join(row[0]) for row in chunk]))
f_out.write('\n')
I used this approach when I wrote TableHunter-For-Oracle

Resources