I tried to import my data in csv format but it took forever to import and I cannot do anything except waiting.
The number of rows and columns of data is 1,705 and 502, respectively. All variables except target and date are numeric type. The data size is only 12MB.
I do not know how many hours I will have to wait to import the data.
Please advise what I can do to try this product on my data.
Fortunately, the problem is resolved.
If my data contains Inf, importing took a lot of time. I replaced Inf with a value, so I was importing quickly :)
Related
I am looking to import csv file data to a temporary table or even any other ways of dealing with csv data import in API platform.
I do know in MySQL, we can import csv using the LOAD DATA INFILE query to a temporary table, validate and store data, in high speed.
Any help would be appreciated!
Context
I'm trying to write a dataframe using PySpark to .csv. In other posts, I've seen users question this, but I need a .csv for business requirements.
What I've Tried
Almost everything. I've tried .repartition(), I've tried increasing driver memory to 1T. I also tried caching my data first and then writing to csv(which is why the screenshots below indicate I'm trying to cache vs. write out to csv) Nothing seems to work.
What Happens
So, the UI does not show that any tasks fail. The job--whether it's writing to csv or caching first, gets close to completion and just hangs.
Screenshots
Then..if I drill down into the job..
And if I drill down further
Finally, here are my settings:
You don't need to cache the dataframe as cache helps when there are multiple actions performed and if not required I would suggest you to remove count also..
Now while saving the dataframe make sure all the executors are being used.
If your dataframe is of 50 gb make sure you are not creating multiple small files as it will degrade the performance.
You can repartition the data before saving so if your dataframe have a column whic equally divides the dataframe use that or find optimum number to repartition.
df.repartition('col', 10).write.csv()
Or
#you have 32 executors with 12 cores each so repartition accordingly
df.repartition(300).write.csv()
As you are using databricks.. can you try Using the databricks-csv package and let us know
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
train.write.format('com.databricks.spark.csv').save('file_after_processing.csv')
I have created sqoop job to import data from Netezza. It imports data by comparing a timestamp column (checking column) from source on daily basis. I am observing that, the job is importing more number of records each day when compared with source table in Netezza.
There seems no problem or error with the job. The 'incremental.last.value' is also updated properly for each run.
How can I find out what is wrong with the job. I am using Sqoop version: 1.4.5.2.2.6.0-2800
Can you please show the sqoop job statement used.Have you used any split-by column in the sqoop job, if yes try using other split-by column.
More investigation showed the job is working correctly. Problem is with verification method. I was trying to validate the number of rows on a given date in Netezza and Hive. But, the date value of checking-column gets updated in Netezza. These updates are not reflected on Hive by any means. Hence, the number of records for a day doesn't remain constant at Netezza side.
The problem has given a good learning to first check all the conditions of a scenario under consideration.There may be many factors involved in achieving a output other than just correctness of code written.
I'm inserting a lot of data e.g. 1 mln documents. How should I insert them? After small tests I have a different time results for inserting all data in arrays of 500 and 1000 size (bulk). In my use case 500 is faster. Which buffer size should I use? Any suggestions?
For batch inserts like the one you are talking about it would be better to use the appropriately named mongoimport command line tool.
The mongoimport tool provides a route to import content from a JSON, CSV, or TSV export created by mongoexport, or potentially, another third-party export tool...
We are buying third party survey data. They are providing us data in SAS format.
Source data format - SAS
Frequency - Daily
Data - Full one year data set (no delta)
We would like to bring this data into our Hadoop environment on daily basis. What are our options.
We asked them to send the data in text file. But their text file had 8650 columns (for ex. Country .. so they had 250 columns - one with each country). Our ETL tool failed to process that many columns. According to them it is mush easier to read data in SAS format.
Any suggestion..
Thx
The problem here is not a technology problem... It sounds like they are just being unhelpful. I do most of my work in SAS and would never provide someone with a table with that many columns and expect them to import it.
Even if they sent it in SAS format, the SAS dataset is still going to have the same number of columns and the ETL tool (even if it could read in SAS datasets - which is unlikely) is still likely to fail.
Tell them to transpose the data in SAS so that there are fewer columns and then to re-send it as a text file.
Thanks Everyone..
I think, this would solve my issue:
http://www.ats.ucla.edu/stat/sas/modules/tolong.htm