Load on demand for dynamic report - performance

I had a huge CSV file which is of 2GB for which I would like to generate a dynamic report. Is there any possibility to load only 1st few MB of data into that report and on scrolling the report to next page then load next few MB of data and so on... for good performance to avoid crashes data to be visualized is huge?

Our JRCsvDataSource implementation (assuming this is the one you want to use) does not consume memory while reading CSV data from the file or input stream, as it does not hold to any values and just parses data row by row. But if the data amount is huge, then the report output itself will be huge, in which case you need to use a report virtualizer during report filling and then viewing/exporting.

Related

Incremental data mapping

How can we efficiently load the data from an incremental CSV file without reading the whole file repetitively?
I have used the timestamp information from the given file to load the data in every 5 minutes but in case the timestamp information is not available how can we make it work?

Partitioned parquet file takes more space and more time to query

Theoretically a Parquet file is expected to take less space than CSV and should provide results quicker. My experiment shows the opposite.
https://github.com/yashgt/Samples/blob/master/Parquet.ipynb
I am converting the CSV file at
to a Parquet file partitioned on the "city" field.
The activity takes 7m
The size of the Parquet folder is 48MB, while CSV is 2.5MB.
Querying the Parquet with a filtering criteria on "city" takes 350ms
while the CSV takes 111ms.
The code is here https://github.com/yashgt/Samples/blob/master/Parquet.ipynb
The executed notebook in PDF form is here https://github.com/yashgt/Samples/raw/master/parquet.pdf
What am I doing wrong?
you should do this test on a much larger dataset to see the expected results. parquet is columnar storage for big data analytics. it has lots of metadata and in your case it might be not efficient compared with the content size itself so you dont have any benefits of the fact that you select only few columns or even all given that this is the dataset size compared to csv.

Page level skip/read in apache parquet

Question: Does Parquet have the ability to skip/read certain pages in a column chunk based on the query we run?
Can page header metadata help here?
http://parquet.apache.org/documentation/latest/
Under File Format, I read this statement and it seemed doubtful
Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.

IBM BigSheets Issue

I am getting some error in loading my files onto big sheets both directly from the HDFS( files that are output of pig scripts) and also raw data that is lying on the local hard disk.
I have observed that whenever I am loading the files and issuing a row count to see if all data is loaded into bigsheets, then I see lesses number of rows being loaded.
I have checked that the files are consistent and proper delimeters(/t or comma separated fields).
Size of my file is around 2GB and I have used either of the format *.csv/ *.tsv.
Also in some cases when i have tired to load a file from windows os directly then the files sometimes load successfully with row count matching with actual number of lines in the data, and then sometimes with lesser number of rowcount.
Even sometimes when a fresh file being used 1st time it gives the correct result but if I do the same operation next time some rows are missing.
Kindly share your experience your bigsheets, solution to any such problems where the entire data is not being loaded etc. Thanks in advance
The data that you originally load into BigSheets is only a subset. You have to run the sheet to get it on the full dataset.
http://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/t0057547.html?lang=en

How to solve Heap Space Error in Talend Enterprise Big Data

I am using the tFileInputJson and tMongoDBOutput components to store JSON data into a MongoDB Database.
When trying this with a small amount of data (nearly 100k JSON objects), the data can be stored into database with out any problems.
Now my requirement is to store nearly 300k JSON objects into the database and my JSON objects look like:
{
"LocationId": "253b95ec-c29a-430a-a0c3-614ffb059628",
"Sdid": "00DlBlqHulDp/43W3eyMUg",
"StartTime": "2014-03-18 22:22:56.32",
"EndTime": "2014-03-18 22:22:56.32",
"RegionId": "10d4bb4c-69dc-4522-801a-b588050099e4",
"DeviceCategories": [
"ffffffff-ffff-ffff-ffff-ffffffffffff",
"00000000-0000-0000-0000-000000000000"
],
"CheckedIn": false
}
While I am performing this operation I am getting the following Exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
[statistics] disconnected
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuffer.append(StringBuffer.java:237)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toString(Unknown Source)
at samplebigdata.retail_store2_0_1.Retail_Store2.tFileInputJSON_1Process(Retail_Store2.java:1773)
at samplebigdata.retail_store2_0_1.Retail_Store2.runJobInTOS(Retail_Store2.java:2469)
at samplebigdata.retail_store2_0_1.Retail_Store2.main(Retail_Store2.java:2328)
Job Retail_Store2 ended at 15:14 10/11/2014. [exit code=1]
My current job looks like:
How can I store so much data into the database in a single job?
The issue here is that you're printing the JSON object to the console (with your tLogRow). This requires all of the JSON objects to be held in memory before finally being dumped all at once to the console once the "flow" is completed.
If you remove the tLogRow components then (in a job as simple as this) Talend should only hold whatever the batch size is for your tMongoDbOutput component in memory and keep pushing batches into the MongoDB.
As an example, here's a screenshot of me successfully loading 100000000 rows of randomly generated data into a MySQL database:
The data set represents about 2.5 gb on disk when as a CSV but was comfortably handled in memory with a max heap space of 1 gb as each insert is 100 rows so the job only really needs to keep 100 rows of the CSV (plus any associated metadata and any Talend overheads) in memory at any one point.
In reality, it will probably keep significantly more than that in memory and simply garbage collect the rows that have been inserted into the database when the max memory is close to being reached.
If you have an absolute requirement for logging the JSON records that are being successfully put into the database then you might try outputting into a file instead and stream the output.
As long as you aren't getting too many invalid JSON objects in your tFileInputJson then you can probably keep the reject linked tLogRow as it will only receive the rejected/invalid JSON objects and so shouldn't run out of memory. As you are restricted to small amounts of memory due to being on a 32 bit system you might need to be wary that if the amount of invalid JSON objects grows you will quickly exceed your memory space.
If you simply want to load a large amount of JSON objects to a MongoDB database then you will probably be best off using the tMongoDBBulkLoad component. This takes a flat file (either .csv .tsv or .json) and loads this directly into a MongoDB database. The documentation I just linked to shows all the relevant options but you might be particularly interested by the --jsonArray additional argument that can be passed to the database. There is also a basic example in how to use the component.
This would mean you couldn't do any processing mid way through the load and you are having to use a preprepared json/csv file to load the data but if you just want a quick way to load data into the database using Talend then this should cover it.
If you needed to process chunks of the file at a time then you might want to look at a much more complicated job with a loop where you load n records from your input, process them and then restart the processing part of the loop but selecting n records with a header of n records and then repeat with a header of 2n records and so on...
Garpmitzn's answer pretty much covers how to change JVM settings to increase memory space but for something as simple as this you just want to reduce the amount you're keeping in memory for no good reason.
As an aside, if you're paying out for an Enterprise licence of Talend then you should probably be able to get yourself a 64 bit box with 16 gb of RAM easily enough and that will drastically help with your development. I'd at least hope that your production job execution server has a bunch of memory.
i feel you are reading into memory of talend. you have to play with java JVM parameters like Xms and XmX - you can increase Xmx to say bigger size then what its currently set for you say if its set to Xmx2048 then increase it to Xmx4096 or otherwise..
these parameters are available in .bat/.sh file of exported job or in talend studio you can find them under Run Job tab Advance settings JVM Settings...
but its advisable, to design the job in such a way that you dont load too much in memory..

Resources