I have training set of 2.4M data with 100K dimension.
When I load the data into SupervisedDataSet using addSample(), it calls MemoryError with only 130K data loaded.
Is there any method to load really big data onto SupervisedDataSet?
Related
How can we efficiently load the data from an incremental CSV file without reading the whole file repetitively?
I have used the timestamp information from the given file to load the data in every 5 minutes but in case the timestamp information is not available how can we make it work?
My organization is thinking about offloading the unstructured data like Text , images etc saved as part of Tables in Oracle Database , into Hadoop. The size of the DB is around 10 TB and growing. The size of the CLOB/BLOB columns is around 3 TB.Right now these columns are queried for certain kind of reports through a web application. They are also written into but not very frequently.
What kind of approach we can take to achieve proper offloading of data and ensuring that the offloaded data is available for read through existing web application.
You can get part of the answer in oracle blog (link).
If data needs to be pulled in HDFS environment via sqoop, then you must first read the following from sqoop documentation.
Sqoop handles large objects (BLOB and CLOB columns) in particular ways. If this data is truly large, then these columns should not be fully materialized in memory for manipulation, as most columns are. Instead, their data is handled in a streaming fashion. Large objects can be stored inline with the rest of the data, in which case they are fully materialized in memory on every access, or they can be stored in a secondary storage file linked to the primary data storage. By default, large objects less than 16 MB in size are stored inline with the rest of the data. At a larger size, they are stored in files in the _lobs subdirectory of the import target directory. These files are stored in a separate format optimized for large record storage, which can accomodate records of up to 2^63 bytes each. The size at which lobs spill into separate files is controlled by the --inline-lob-limit argument, which takes a parameter specifying the largest lob size to keep inline, in bytes. If you set the inline LOB limit to 0, all large objects will be placed in external storage.
Reading via web application is possible if you are using MPP query engine like Impala and it works pretty well and it is production ready technology. We heavily use complex Impala queries to render content for SpringBoot application. Since Impala runs everything in memory, there is a chance of slowness or failure if it is multi-tenant Cloudera cluster. For smaller user groups (1000-2000 user base) it works perfectly fine.
Do let me know if you need more input.
Recommendation will be
Use Cloudera distribution (read here)
Give enough memory for Impala Deamons
Make sure you YARN is configured correctly for schedule (fair share or priority share) based ETL load vs Web Application Load
If required keep the Impala Daemons away from YARN
Define memory quota for Impala Memory so it allows concurrent queries
Flatten your queries so Impala runs faster without joins and shuffles.
If you are reading just a few columns, store in Parquet, it works very fast.
I loaded my data in types of and by using Apache Ignite Cache.
type of data is approximately 2GB size and it is in form of byte[5604*300,000] and I get the cache values from the Ignite Client using "Iterator it = cache.localEntries(CachePeekMode.PRIMARY).iterator()"
The problem is that it takes too much time to transform the loaded data into byte[] type.
Is there any faster way to bring the loaded data from the cache?
and currenetly loacalEntries' returning data type is Iterator. Is there any way to return different data type?
I had a huge CSV file which is of 2GB for which I would like to generate a dynamic report. Is there any possibility to load only 1st few MB of data into that report and on scrolling the report to next page then load next few MB of data and so on... for good performance to avoid crashes data to be visualized is huge?
Our JRCsvDataSource implementation (assuming this is the one you want to use) does not consume memory while reading CSV data from the file or input stream, as it does not hold to any values and just parses data row by row. But if the data amount is huge, then the report output itself will be huge, in which case you need to use a report virtualizer during report filling and then viewing/exporting.
I am using the tFileInputJson and tMongoDBOutput components to store JSON data into a MongoDB Database.
When trying this with a small amount of data (nearly 100k JSON objects), the data can be stored into database with out any problems.
Now my requirement is to store nearly 300k JSON objects into the database and my JSON objects look like:
{
"LocationId": "253b95ec-c29a-430a-a0c3-614ffb059628",
"Sdid": "00DlBlqHulDp/43W3eyMUg",
"StartTime": "2014-03-18 22:22:56.32",
"EndTime": "2014-03-18 22:22:56.32",
"RegionId": "10d4bb4c-69dc-4522-801a-b588050099e4",
"DeviceCategories": [
"ffffffff-ffff-ffff-ffff-ffffffffffff",
"00000000-0000-0000-0000-000000000000"
],
"CheckedIn": false
}
While I am performing this operation I am getting the following Exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
[statistics] disconnected
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuffer.append(StringBuffer.java:237)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toString(Unknown Source)
at samplebigdata.retail_store2_0_1.Retail_Store2.tFileInputJSON_1Process(Retail_Store2.java:1773)
at samplebigdata.retail_store2_0_1.Retail_Store2.runJobInTOS(Retail_Store2.java:2469)
at samplebigdata.retail_store2_0_1.Retail_Store2.main(Retail_Store2.java:2328)
Job Retail_Store2 ended at 15:14 10/11/2014. [exit code=1]
My current job looks like:
How can I store so much data into the database in a single job?
The issue here is that you're printing the JSON object to the console (with your tLogRow). This requires all of the JSON objects to be held in memory before finally being dumped all at once to the console once the "flow" is completed.
If you remove the tLogRow components then (in a job as simple as this) Talend should only hold whatever the batch size is for your tMongoDbOutput component in memory and keep pushing batches into the MongoDB.
As an example, here's a screenshot of me successfully loading 100000000 rows of randomly generated data into a MySQL database:
The data set represents about 2.5 gb on disk when as a CSV but was comfortably handled in memory with a max heap space of 1 gb as each insert is 100 rows so the job only really needs to keep 100 rows of the CSV (plus any associated metadata and any Talend overheads) in memory at any one point.
In reality, it will probably keep significantly more than that in memory and simply garbage collect the rows that have been inserted into the database when the max memory is close to being reached.
If you have an absolute requirement for logging the JSON records that are being successfully put into the database then you might try outputting into a file instead and stream the output.
As long as you aren't getting too many invalid JSON objects in your tFileInputJson then you can probably keep the reject linked tLogRow as it will only receive the rejected/invalid JSON objects and so shouldn't run out of memory. As you are restricted to small amounts of memory due to being on a 32 bit system you might need to be wary that if the amount of invalid JSON objects grows you will quickly exceed your memory space.
If you simply want to load a large amount of JSON objects to a MongoDB database then you will probably be best off using the tMongoDBBulkLoad component. This takes a flat file (either .csv .tsv or .json) and loads this directly into a MongoDB database. The documentation I just linked to shows all the relevant options but you might be particularly interested by the --jsonArray additional argument that can be passed to the database. There is also a basic example in how to use the component.
This would mean you couldn't do any processing mid way through the load and you are having to use a preprepared json/csv file to load the data but if you just want a quick way to load data into the database using Talend then this should cover it.
If you needed to process chunks of the file at a time then you might want to look at a much more complicated job with a loop where you load n records from your input, process them and then restart the processing part of the loop but selecting n records with a header of n records and then repeat with a header of 2n records and so on...
Garpmitzn's answer pretty much covers how to change JVM settings to increase memory space but for something as simple as this you just want to reduce the amount you're keeping in memory for no good reason.
As an aside, if you're paying out for an Enterprise licence of Talend then you should probably be able to get yourself a 64 bit box with 16 gb of RAM easily enough and that will drastically help with your development. I'd at least hope that your production job execution server has a bunch of memory.
i feel you are reading into memory of talend. you have to play with java JVM parameters like Xms and XmX - you can increase Xmx to say bigger size then what its currently set for you say if its set to Xmx2048 then increase it to Xmx4096 or otherwise..
these parameters are available in .bat/.sh file of exported job or in talend studio you can find them under Run Job tab Advance settings JVM Settings...
but its advisable, to design the job in such a way that you dont load too much in memory..