Calculate size of SparkR dataframe - sparkr

I have a dataframe I got by making a query using a SQLContext:
> df <- sql(sqlContext, "SELECT * FROM myTable")
when I try to get its size
> object.size(df)
1024 bytes
I know that it is not the real size of the dataframe, probably because it's distributed over Spark nodes. To get the real size I need to collect it:
> localDf <- collect(df)
> object.size(localDf)
45992 bytes
Sometimes the dataframe is too big to fit in the local memory. Is there a simple way to know the actual size of a dataframe without bringing it locally?

One way to do this is to use the Spark Web UI. Under the Executors tab you can look at Storage Memory.

I actually found a satisfactory workaround for this problem. I set the following Spark configuration to load the SparkContext:
spark.driver.maxResultSize=1m
In this case, when the result is bigger than 1MB, spark will return a org.apache.spark.SparkException, so I caught it and returned an error message.

Related

Poor spark performance writing to csv

Context
I'm trying to write a dataframe using PySpark to .csv. In other posts, I've seen users question this, but I need a .csv for business requirements.
What I've Tried
Almost everything. I've tried .repartition(), I've tried increasing driver memory to 1T. I also tried caching my data first and then writing to csv(which is why the screenshots below indicate I'm trying to cache vs. write out to csv) Nothing seems to work.
What Happens
So, the UI does not show that any tasks fail. The job--whether it's writing to csv or caching first, gets close to completion and just hangs.
Screenshots
Then..if I drill down into the job..
And if I drill down further
Finally, here are my settings:
You don't need to cache the dataframe as cache helps when there are multiple actions performed and if not required I would suggest you to remove count also..
Now while saving the dataframe make sure all the executors are being used.
If your dataframe is of 50 gb make sure you are not creating multiple small files as it will degrade the performance.
You can repartition the data before saving so if your dataframe have a column whic equally divides the dataframe use that or find optimum number to repartition.
df.repartition('col', 10).write.csv()
Or
#you have 32 executors with 12 cores each so repartition accordingly
df.repartition(300).write.csv()
As you are using databricks.. can you try Using the databricks-csv package and let us know
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')
train.write.format('com.databricks.spark.csv').save('file_after_processing.csv')

increase efficiency of sqoop export from hdfs

I am trying to export data using sqoop from files stored in hdfs to vertica. For around 10k's of data the files get loaded within a few minutes. But when I try to run crores of data, it is loading around .5% within 15 mins or so. I have tried to increase the number of mappers, but they are not serving any purpose to improve efficienct. Even setting the chunk size to increase the number the mappers, does not increase the number.
Please help.
Thanks!
As you are using Batch export try increasing the records per transaction and records per statement parameter using the following properties:
sqoop.export.records.per.statement : property will aggregate multiple rows inside one single insert statement.
sqoop.export.records.per.transaction: how many insert statements will be issued per transaction
I hope these will surely solves the issue.
Most MPP/RDBMS have sqoop connectors to exploit the parallelism and increase efficiency in transfer of data between HDFS and MPP/RDBMS. However it seems the vertica has taken this approach: http://www.vertica.com/2012/07/05/teaching-the-elephant-new-tricks/
https://github.com/vertica/Vertica-Hadoop-Connector
Is this a "wide" dataset? It might be a sqoop bug https://issues.apache.org/jira/browse/SQOOP-2920 if number of columns is very high (in hundreds), sqoop starts choking (very high on cpu). When number of fields is small, it's usually other way around - when sqoop is bored and rdbms systems can't keep up.

Is there an Alternative for HBaseStorage in PIG

I am using HBaseStorage with -caching option in pig script as follows
HBaseStorage('countDetails:ansCount countDetails:divCount countDetails:unansCount countDetails:engCount countDetails:ineffCount countDetails:totalCount', '-caching 1000');
I can see this was reflecting in my job.xml
but I can see there is no time difference in it I am processing 10 million records and store data around 160mb in to HBase.
When I store the result in hdfs its taking 3 mins to process the same job takes 30mins to store into HBase.
I even tried by setting
SET hbase.client.scanner.caching 1000;
Please let me know how can I reduce the time.
Is there any alternative for HBaseStorage?
http://apmblog.compuware.com/2013/02/19/speeding-up-a-pighbase-mapreduce-job-by-a-factor-of-15/
the above blog says that I have to set hbase.client.scanner.caching in bootstrap scrip
I don't know how to do that
will it be enough If I set it in Hbase-conf.
Please help me out of this
hbase.client.scanner.caching points to number of rows that will be fetched when calling next on a scanner if it is not served from (local, client) memory.
Higher caching values will enable faster scanners but will eat up more memory and some calls of next may take longer and longer time when the cache is empty. Do not set this value such that the time between invocations is greater than the scanner timeout;
i.e. hbase.regionserver.lease.period This property is 1 min by default. Clients must
report in within this period else they are considered dead.
In my experience HBase doesn't perform very well with Pig. It you don't have requirement for random look-up then use only HDFS otherwie HBase MR job would be better option. Also, In Hadoop MR job, you can connect to Hbase(This option gave me the best performance).

Reduce job pending in HFileOutputFormat

I am using
Hbase:0.92.1-cdh4.1.2, and
Hadoop:2.0.0-cdh4.1.2
I have a mapreduce program that will load data from HDFS to HBase using HFileOutputFormat in cluster mode.
In that mapreduce program i'm using HFileOutputFormat.configureIncrementalLoad() to bulk load a 800000 record
data set which is of 7.3GB size and it is running fine, but it's not running for 900000 record data set which is of 8.3GB.
In the case of 8.3GB data my mapreduce program have 133 maps and one reducer,all maps completed successfully.My reducer status is always in Pending for a long time. There is nothing wrong with the cluster since other jobs are running fine and this job also running fine upto 7.3GB of data.
What could i be doing wrong?
How do I fix this issue?
I ran into the same problem. Looking at the DataTracker logs, I noticed there was not enough free space for the single reducer to run on any of my nodes:
2013-09-15 16:55:19,385 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_slave01.mydomain.com:localhost/127.0.0.1:43455 has 503,777,017,856 bytes free; but we expect reduce input to take 978136413988
This 503gb refers to the free space available on one of the hard drives on the particular slave ("tracker_slave01.mydomain.com"), thus the reducer apparently needs to copy all the data to a single drive.
The reason this happens is your table only has one region when it is brand new. As data is inserted into that region, it'll eventually split on its own.
A solution to this is to pre-create your regions when creating your table. The Bulk Loading Chapter in the HBase book discusses this, and presents two options for doing this. This can also be done via the HBase shell (see create's SPLITS argument I think). The challenge though is defining your splits such that the regions get an even distribution of keys. I've yet to solve this problem perfectly, but here's what I'm doing currently:
HTableDescriptor desc = new HTableDescriptor();
desc.setName(Bytes.toBytes(tableName));
desc.addFamily(new HColumnDescriptor("my_col_fam"));
admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647), 100);
An alternative solution would be to not use configureIncrementalLoad, and instead: 1) just generate your HFile's via MapReduce w/ no reducers; 2) use completebulkload feature in hbase.jar to import your records to HBase. Of course, I think this runs into the same problem with regions, so you'll want to create the regions ahead of time too (I think).
Your job is running with single reduces, means 7GB data getting processed on single task.
The main reason of this is HFileOutputFormat starts reducer that sorts and merges data to be loaded in HBase table.
here, Num of Reducer = num of regions in HBase table
Increase the number of regions and you will achieve parallelism in reducers. :)
You can get more details here:
http://databuzzprd.blogspot.in/2013/11/bulk-load-data-in-hbase-table.html

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

I am getting:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
While trying to make a copy of a partitioned table using the commands in the hive console:
CREATE TABLE copy_table_name LIKE table_name;
INSERT OVERWRITE TABLE copy_table_name PARTITION(day) SELECT * FROM table_name;
I initially got some semantic analysis errors and had to set:
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict
Although I'm not sure what the above properties do?
Full ouput from hive console:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201206191101_4557, Tracking URL = http://jobtracker:50030/jobdetails.jsp?jobid=job_201206191101_4557
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=master:8021 -kill job_201206191101_4557
2012-06-25 09:53:05,826 Stage-1 map = 0%, reduce = 0%
2012-06-25 09:53:53,044 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201206191101_4557 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
That's not the real error, here's how to find it:
Go to the hadoop jobtracker web-dashboard, find the hive mapreduce jobs that failed and look at the logs of the failed tasks. That will show you the real error.
The console output errors are useless, largely beause it doesn't have a view of the individual jobs/tasks to pull the real errors (there could be errors in multiple tasks)
I know I am 3 years late on this thread, however still providing my 2 cents for similar cases in future.
I recently faced the same issue/error in my cluster.
The JOB would always get to some 80%+ reduction and fail with the same error, with nothing to go on in the execution logs either.
Upon multiple iterations and research I found that among the plethora of files getting loaded some were non-compliant with the structure provided for the base table(table being used to insert data into partitioned table).
Point to be noted here is whenever I executed a select query for a particular value in the partitioning column or created a static partition it worked fine as in that case error records were being skipped.
TL;DR: Check the incoming data/files for inconsistency in the structuring as HIVE follows Schema-On-Read philosophy.
Adding some information here, as it took me awhile to find the hadoop jobtracker web-dashboard in HDInsight (Azure's Hadoop), and a colleague finally showed me where it was. There is a shortcut on the head node called "Hadoop Yarn Status" which is just a link to a local http page (http://headnodehost:9014/cluster in my case). When opened the dashboard looked like this:
In that dashboard you can find your failed application, and then after clicking into it you can look at the logs of the individual map and reduce jobs.
In my case it seemed to still be running out of memory in the reducers, even though I had cranked the memory in the configuration already. For some reason it was not surfacing the "java outofmemory" errors I got earlier though.
The top answer is right, that the error code doesn't give you much info. One of the common causes that we saw in our team for this error code was when the query was not optimized well. A known reason was when we do an inner join with the left side table magnitudes bigger than the table on right side. Swapping these tables would usually do the trick in such cases.
I removed the _SUCCESS file from the EMR output path in S3 and it worked fine.
I was also facing same error when I was inserting the data into HIVE external table which was pointing to Elastic search cluster.
I replaced the older JAR elasticsearch-hadoop-2.0.0.RC1.jar to elasticsearch-hadoop-5.6.0.jar, and everything worked fine.
My Suggestion is please use the specific JAR as per the elastic search version. Don't use older JARs if you are using newer version of elastic search.
Thanks to this post Hive- Elasticsearch Write Operation #409
Received this error when joining two tables. And one table is large in size and another table is small, which could fit into disk memory. In such a case, use
set hive.auto.convert.join = false
This might help to get rid of the above error. For more detail on this issue please refer to the below threads
Hive Map-Join configuration mystery
Hive.auto.convert.join = true what is the significance of this?
Even I faced the same issue - when checked on dashboard I found following Error. As the data was coming through Flume and had interrupted in between due to which may be there was inconsistency in few files.
Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected end-of-input within/between OBJECT entries
Running on fewer files it worked. Format consistency was the reason in my case.
I faced the same issue because I didn't have permission to query the database I was trying to.
In the case you don't have permission to query the table/database, besides the Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask error, you will see that in Cloudera Manager is not even registering your query.
In my case, the solution was adding more RAM Memory to the Virtual Machines. Sometimes code 2 means that Map and Reduce nodes do not have enough memory.
Another option could be changing the properties "mapreduce.map.memory.mb" y "mapreduce.reduce.memory.mb" in the mapred-site.xml file.
I got the same error while creating the hive table in beeline and then tried to create through spark-shell which thrown actual error. In my case error was with disk space quota for hdfs directory.
org.apache.hadoop.ipc.RemoteException: The DiskSpace quota of /user/hive/warehouse/XXX_XX.db is exceeded: quota = 6597069766656 B = 6 TB but diskspace consumed = 6597493381629 B = 6.00 TB

Resources