I am trying to import a big xlsx file to Mathematica 9, but I always get the error
Import::nojmem: There was insufficient Java heap space for the
operation. Try increasing the Java Virtual Machine heap size.
Increasing heap size isn't helping.
I also tried to save my file as csv but the output is a long list of values. How can I import my file correctly or how can I convert this list to the matrix form? Thanks.
Related
requirements:
When I want to merge through Mergerecord, with the Bin size as the restriction, and use PutPqrquet to write to HDFS in Snappy compression, so that the size of the written Parquet file is the same as the Bin size.
My approach:
Split
I used SplitRecord after reading the existing data, breaking the records into streams every 10, in order to keep the merge unit smaller. The Processor is written in AvroRecordSetWriter and in the Snappy compressed format.
Merge
I used Mergerecord to merge and split the stream files, AvroReader and AvroRecordsetWriter, selected the bin-packing algorithm, set the maximum Bin number as 1, the minimum Record number as 1, the maximum Record number as 100000000, the maximum Bin size as 1024MB, the minimum Bin size as 1000MB, and the Max Bin Age as 1 day.
According to the actual data and the rules of the BIN algorithm, the data will be automatically merged when it reaches 1024MB. When 1024MB is not reached, it will also automatically merge after a day.
Write the HDFS
Use PutParquet to select AvroReader, Snappy Compress, Row Group Size 1GB
The actual write condition:
During the merge process, I found that each stream file was 1GB, but the Parquet file written to HDFS was only around 950MB, not 1000-1024MB.
Ask questions:
What rules does Mergerecord use to calculate Bin sizes? I initially thought I was calculating the size of the data written by AvroRecordSetWriter, but that doesn't seem to be the case. Can someone tell me?
I have parquet file that is 800K rows x 8.7K columns. I loaded it into a dask dataframe:
import dask.dataframe as dd
dask_train_df = dd.read_parquet('train.parquet')
dask_train_df.info()
This yields:
<class 'dask.dataframe.core.DataFrame'>
Columns: 8712 entries, 0 to 8711
dtypes: int8(8712)
When I try to do simple operations like dask_train_df.head() or dask_train_df.loc[2:4].compute(), I get memory errors, even with 17+ GB of RAM.
However, if I do:
import pandas as pd
train = pd.read_parquet('../input/train.parquet')
train.info()
yields:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Columns: 8712 entries, 0 to 8711
dtypes: int8(8712)
memory usage: 6.5 GB
and I can run train.head() and train.loc[2:4]
with no problems since everything is in memory already.
1) So my question is why do these simple operations blow up the memory usage using a Dask Dataframe, but works fine with when I load everything into memory using a Pandas Dataframe?
I notice that npartitions=1, and I see that in the documentation that read_parquet "reads a directory of Parquet data into a Dask.dataframe, one file per partition". In my case, it sounds like I'm losing out on all of the parallelization power of having multiple partitions, but then shouldn't the Dask Dataframe memory usage be capped by the amount of memory of the single Pandas Dataframe?
2) Also, a side question: If I wanted to parallelize this single parquet file by partitioning it in a Dask Dataframe, how would I do so? I don't see a blocksize parameter in the dd.read_parquet signature. I also tried using the repartition function, but I believe that partitions along the rows and in a parquet file, I would want to partition along the columns?
First, I would like to comment that 8712 columns is rather many, and you will find that parsing the schema/metadata may take significant time, never mind the data loading.
When fastparquet loads data, it first allocates a dataframe of sufficient size, then iterates through the columns/chunks (with appropriate overheads, which apparently are small in this case) and assigns values into the allocated dataframe.
When you run a calculation through Dask (any calculation), there can in many cases be intra-task copies in memory of the input variables and other intermediate objects. That is usually not an issue, as the whole data-set should be split into many parts, and the small intermediates' memory overhead is a price worth paying for being able to handle datasets larger than memory. I am not sure at which point you are getting a copy, it may be worth investigating and preventing.
In your case, the whole data-set is a single partition. This will result in a single load task, running in one thread. You will not be getting any parallelism, and any intermediate internal copies apply to the whole dataset. You could load only part of the data by selecting columns, and so manufacture partitions and achieve parallelism that way. However, the typical way to handle parquet data is to make use of "row-group" partitions (i.e., along the index) and multiple files, so the real way to avoid the problem is to use data which is already appropriately partitioned.
Note that since you can load the data directly with fastparquet/pandas, you could probably also save a partitioned version either with the to_parquet method or fastparquet's write function.
In order to reduce the number of blocks allocated by the NameNode. I'm trying to concatenate some small files to 128MB files. These small files are in gz format and the 128MB files must be in gz format too.
To accomplish this, I'm getting the sum size of all small files and divide this sum size(in MB) by 128 to get the number of files I need.
Then I perform a rdd.repartition(nbFiles).saveAsTextFile(PATH,classOf[GzipCodec])
The problem is that my output directory size is higher thant my input directory size (10% higher). I tested with default and best compression level and I'm always getting an higher output size.
I have no idea why my output directory is getting higher than my input directory but I imagine it's linked to the fact that i'm repartitioning all the files of the input directory.
Can someone help me to understand why i'm getting this result?
Thanks :)
Level of compression will depend on the data distribution. When you rdd.repartition(nbFiles) you randomly shuffle all the data so if there was some structure in the input, which reduced entropy and enabled better compression, it will be lost.
You can try some other approach, like colaesce without shuffle or sorting to see if you can get a better result.
Currently, I'm programming something on image classification with Spark. I need to read all the images into memory as RDD and my method is as following:
val images = spark.wholeTextFiles("hdfs://imag-dir/")
imag-dir is the target image storing directory on hdfs. With this method, all the images will be loaded into memory and every image will be organized as "image name, image content" pair. However, I find this process is time consuming, is there any better way to load large image data set into spark?
I suspect that may be because you have a lot of small files on HDFS, which is a problem as such (the 'small files problem'). Here you'll find a few suggestions in addressing the issue.
You may also want to set the number of partitions (the minpartitions argument of wholetextFiles) to a reasonable number : at least 2x the number of cores in your cluster (look there for details).
But in sum, apart from the 2 ideas above, the way you're loading those is ok and not where your problem lies (assuming spark is your Spark context).
How to get the memory size of a particular record (tuple) in Apache Pig ? Is there any function which helps us get that ?
Yes,
You can try to use the builtin UDF SIZE.
http://pig.apache.org/docs/r0.11.1/api/org/apache/pig/builtin/SIZE.html
So if you want to compute the size of the entire tuple, you will probably need to first cast all of the tuple fields to bytearray, then use SIZE on each of them, and finely summarize all of them together to get the tuple size in bytes.
Obviously You can convert it then to KB.