I am using the tFileInputJson and tMongoDBOutput components to store JSON data into a MongoDB Database.
When trying this with a small amount of data (nearly 100k JSON objects), the data can be stored into database with out any problems.
Now my requirement is to store nearly 300k JSON objects into the database and my JSON objects look like:
{
"LocationId": "253b95ec-c29a-430a-a0c3-614ffb059628",
"Sdid": "00DlBlqHulDp/43W3eyMUg",
"StartTime": "2014-03-18 22:22:56.32",
"EndTime": "2014-03-18 22:22:56.32",
"RegionId": "10d4bb4c-69dc-4522-801a-b588050099e4",
"DeviceCategories": [
"ffffffff-ffff-ffff-ffff-ffffffffffff",
"00000000-0000-0000-0000-000000000000"
],
"CheckedIn": false
}
While I am performing this operation I am getting the following Exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
[statistics] disconnected
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuffer.append(StringBuffer.java:237)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toString(Unknown Source)
at samplebigdata.retail_store2_0_1.Retail_Store2.tFileInputJSON_1Process(Retail_Store2.java:1773)
at samplebigdata.retail_store2_0_1.Retail_Store2.runJobInTOS(Retail_Store2.java:2469)
at samplebigdata.retail_store2_0_1.Retail_Store2.main(Retail_Store2.java:2328)
Job Retail_Store2 ended at 15:14 10/11/2014. [exit code=1]
My current job looks like:
How can I store so much data into the database in a single job?
The issue here is that you're printing the JSON object to the console (with your tLogRow). This requires all of the JSON objects to be held in memory before finally being dumped all at once to the console once the "flow" is completed.
If you remove the tLogRow components then (in a job as simple as this) Talend should only hold whatever the batch size is for your tMongoDbOutput component in memory and keep pushing batches into the MongoDB.
As an example, here's a screenshot of me successfully loading 100000000 rows of randomly generated data into a MySQL database:
The data set represents about 2.5 gb on disk when as a CSV but was comfortably handled in memory with a max heap space of 1 gb as each insert is 100 rows so the job only really needs to keep 100 rows of the CSV (plus any associated metadata and any Talend overheads) in memory at any one point.
In reality, it will probably keep significantly more than that in memory and simply garbage collect the rows that have been inserted into the database when the max memory is close to being reached.
If you have an absolute requirement for logging the JSON records that are being successfully put into the database then you might try outputting into a file instead and stream the output.
As long as you aren't getting too many invalid JSON objects in your tFileInputJson then you can probably keep the reject linked tLogRow as it will only receive the rejected/invalid JSON objects and so shouldn't run out of memory. As you are restricted to small amounts of memory due to being on a 32 bit system you might need to be wary that if the amount of invalid JSON objects grows you will quickly exceed your memory space.
If you simply want to load a large amount of JSON objects to a MongoDB database then you will probably be best off using the tMongoDBBulkLoad component. This takes a flat file (either .csv .tsv or .json) and loads this directly into a MongoDB database. The documentation I just linked to shows all the relevant options but you might be particularly interested by the --jsonArray additional argument that can be passed to the database. There is also a basic example in how to use the component.
This would mean you couldn't do any processing mid way through the load and you are having to use a preprepared json/csv file to load the data but if you just want a quick way to load data into the database using Talend then this should cover it.
If you needed to process chunks of the file at a time then you might want to look at a much more complicated job with a loop where you load n records from your input, process them and then restart the processing part of the loop but selecting n records with a header of n records and then repeat with a header of 2n records and so on...
Garpmitzn's answer pretty much covers how to change JVM settings to increase memory space but for something as simple as this you just want to reduce the amount you're keeping in memory for no good reason.
As an aside, if you're paying out for an Enterprise licence of Talend then you should probably be able to get yourself a 64 bit box with 16 gb of RAM easily enough and that will drastically help with your development. I'd at least hope that your production job execution server has a bunch of memory.
i feel you are reading into memory of talend. you have to play with java JVM parameters like Xms and XmX - you can increase Xmx to say bigger size then what its currently set for you say if its set to Xmx2048 then increase it to Xmx4096 or otherwise..
these parameters are available in .bat/.sh file of exported job or in talend studio you can find them under Run Job tab Advance settings JVM Settings...
but its advisable, to design the job in such a way that you dont load too much in memory..
Related
My organization is thinking about offloading the unstructured data like Text , images etc saved as part of Tables in Oracle Database , into Hadoop. The size of the DB is around 10 TB and growing. The size of the CLOB/BLOB columns is around 3 TB.Right now these columns are queried for certain kind of reports through a web application. They are also written into but not very frequently.
What kind of approach we can take to achieve proper offloading of data and ensuring that the offloaded data is available for read through existing web application.
You can get part of the answer in oracle blog (link).
If data needs to be pulled in HDFS environment via sqoop, then you must first read the following from sqoop documentation.
Sqoop handles large objects (BLOB and CLOB columns) in particular ways. If this data is truly large, then these columns should not be fully materialized in memory for manipulation, as most columns are. Instead, their data is handled in a streaming fashion. Large objects can be stored inline with the rest of the data, in which case they are fully materialized in memory on every access, or they can be stored in a secondary storage file linked to the primary data storage. By default, large objects less than 16 MB in size are stored inline with the rest of the data. At a larger size, they are stored in files in the _lobs subdirectory of the import target directory. These files are stored in a separate format optimized for large record storage, which can accomodate records of up to 2^63 bytes each. The size at which lobs spill into separate files is controlled by the --inline-lob-limit argument, which takes a parameter specifying the largest lob size to keep inline, in bytes. If you set the inline LOB limit to 0, all large objects will be placed in external storage.
Reading via web application is possible if you are using MPP query engine like Impala and it works pretty well and it is production ready technology. We heavily use complex Impala queries to render content for SpringBoot application. Since Impala runs everything in memory, there is a chance of slowness or failure if it is multi-tenant Cloudera cluster. For smaller user groups (1000-2000 user base) it works perfectly fine.
Do let me know if you need more input.
Recommendation will be
Use Cloudera distribution (read here)
Give enough memory for Impala Deamons
Make sure you YARN is configured correctly for schedule (fair share or priority share) based ETL load vs Web Application Load
If required keep the Impala Daemons away from YARN
Define memory quota for Impala Memory so it allows concurrent queries
Flatten your queries so Impala runs faster without joins and shuffles.
If you are reading just a few columns, store in Parquet, it works very fast.
I am trying to use spark streaming to deal with some order stream, I have some previous computed features for maybe a buyer_id for order in the stream.
I need to get these features while the Spark Streaming is running.
Now, I stored the buyer_id features in a hive table and load it into and RDD and
val buyerfeatures = loadBuyerFeatures()
orderstream.transform(rdd => rdd.leftOuterJoin(buyerfeatures))
to get the pre-computed features.
another way to deal with this is maybe save the features in to a hbase table. and fire a get on every buyer_id.
which one is better ? or maybe I can solve this in another way.
From my short experience:
Loading the necessary data for the computation should be done BEFORE starting the streaming context:
If you are loading inside a DStream operation, this operation will be repeated at each Batch Inteverval time.
If you load each time from Hive, you should seriously consider overhead costs and possible problems during data transfer.
So, if your data is already computed and "small" enough, load it at the beginning of the program in a Broadcast variable or,even better, in a final variable. Either this, or create an RDD before the DStream and keep it as reference (which looks like what you are doing now), although remember to cache it (always if you have enough space).
If you actually do need to read it at streaming time (maybe you receive your query key from the stream), then try to do it once in a foreachPartition and save it in a local variable.
My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it.
The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow.
The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache.
But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works.
We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB.
What I have already tried:
Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores.
Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale.
Try to use Cassandra but Cassandra does not support GROUP BY.
So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.
If you explicitly call cache or persist on a DataFrame, it will be saved in memory (and/or disk, depending on the storage level you choose) until the context is shut down. This is also valid for sqlContext.cacheTable.
So, as you are using Spark JobServer, you can create a long running context (using REST or at server start-up) and use it for multiple queries on the same dataset, because it will be cached until the context or the JobServer service shuts down. However, using this approach, you should make sure you have a good amount of memory available for this context, otherwise Spark will save a large portion of the data on disk, and this would have some impact on performance.
Additionally, the Named Objects feature of JobServer is useful for sharing specific objects among jobs, but this is not needed if you register your data as a temp table (df.registerTempTable("name")) and cache it (sqlContext.cacheTable("name")), because you will be able to query your table from multiple jobs (using sqlContext.sql or sqlContext.table), as long as these jobs are executed on the same context.
I am trying to scan some rows using prefix filter from the HBase table. I am on HBase 0.96.
I want to increase the throughput of each RPC call so as to reduce the number of request hitting the region.
I tried getCaching(int) and setCacheBlocks(true) on the scan object. I also tried adding resultScanner.next(int). Using all these combination I am still not able to reduce the number of RPC calls. I am still hitting HBase region for each key instead of bringing the multiple keys per RPC call.
The HBase region server/ Datanode has enough CPU and Memory allocated. Also my data is evenly distributed across different region servers. Also the data that I am bring back per key is not a lot.
I observed that when I add more data to the table the time taken for the request increases. It also increases when the number of request increases.
Thank you for your help.
R
Prefix filter is usually a performance killer because they perform full table scan, always use a start and stop row in your scans rather than prefix filter.
Scan scan = new Scan(Bytes.toBytes("prefix"),Bytes.toBytes("prefix~"));
when iterate over the Result from the ResultScanner, every iteration is an RPC call, you can call resultScanner.next(n) to get a batch of results in one go.
Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.