Webdatarocks file size limitation question - webdatarocks

I know that the input data size limit for webdatarocks is 1mb. But I don't know if thats apply by input data directly in the WebDataRocks instance in the code, or only apply by loading local or external json or csv files. Can anybody help me please?

Actually, the input data size limit for WebDataRocks applies to both inline data and loaded from external files. The data size should not exceed 1MB for both cases.

Related

Does stripe size in the ORC File dump represented in Compressed size Format?

We have snappy compressed generated ORC files. I'm just trying to understand the ORC File dump log, and I know that by default the stripe size for ORC would be 64MB. But, I see that each stripe in the ORC file in general has around 5-10MB in size. I just want to know if that sizes are represented in compressed format or my default stripe is itself lesser than 64MB?
Note: I'm using the latest EMR instance in the background and the files are in S3.
Stripe size denote the buffer memomry size which is assigned to change row storage to column storage and then write to HDFS. so alway you see your stripe in HDFS is lower than the size of your stripe(i.e. 64 MB).

How do I stream parquet using pyarrow?

I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset and I'm aware of RecordBatchStreamReader but I'm not sure how to combine them.
How can I use Pyarrow to do this?
At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader (the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.

Page level skip/read in apache parquet

Question: Does Parquet have the ability to skip/read certain pages in a column chunk based on the query we run?
Can page header metadata help here?
http://parquet.apache.org/documentation/latest/
Under File Format, I read this statement and it seemed doubtful
Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.

Load on demand for dynamic report

I had a huge CSV file which is of 2GB for which I would like to generate a dynamic report. Is there any possibility to load only 1st few MB of data into that report and on scrolling the report to next page then load next few MB of data and so on... for good performance to avoid crashes data to be visualized is huge?
Our JRCsvDataSource implementation (assuming this is the one you want to use) does not consume memory while reading CSV data from the file or input stream, as it does not hold to any values and just parses data row by row. But if the data amount is huge, then the report output itself will be huge, in which case you need to use a report virtualizer during report filling and then viewing/exporting.

SSRS - Unzip image varbinary(max) data and display

I'm working with a database-driven application that allows users to upload images which are then zipped and embedded into a database in a varbinary(max) format. I am now trying to get that image to display within an SSRS report (using BI 2005).
How can I convert the file data (which is 65,438 characters long when zipped and 65,535 characters when not zipped) into a normal varbinary format that I can then display in SSRS?
Many thanks in advance!
You'll have to embed a reference to a dll in your project and use a function to decompress the data within SSRS, see for example SharpZipLib. Consider storing the data uncompressed if possible, as the CPU / space trade off is unlikely to be in your favour here, as impage data is likely to have a poor compression ratio (it is usually already compressed).

Resources