Inserting batches of data into CockroachDB results in a maximum allowed size error - cockroachdb

I was inserting batches of data into CockroachDB, multi-batches (up to 10) in a single transaction. After a couple of batches the insert failed with “message size 50 MiB bigger than maximum allowed message size 16 MiB”, which is correct, this batch contained a record with outsized ‘string’.
I added a line to the transaction to update the max_read_buffer_size cluster setting to 100 MiB. But I'm still getting the error.

max_read_buffer_size is a cluster setting. Cluster settings cannot be set inside a transaction. Make sure you update the setting outside of the transaction.

Related

Nifi - FlowFiles piling up before MergeRecord

I've got an issue with flowfiles passing through a mergerecord.
Here is the flow (click on link for image):
Flow Queue
I've tried most of the permutations of the configuration settings but can't seem to get flowfiles out of the queue no matter what I do:
MergeRecord Configuration
Does anyone know what could be blocking this mergerecord from passing flowfiles? It seems the flowfiles are currently "text" files, would they need to be JSON for the mergerecord to group correctly?
The Merge is correlating on TableName - meanining it is only going to merge flowfiles where the TableName attribute is the same value.
However, you only have 10 total bins - meaning if 10 flowfiles come in with table1,2,3,4,5,6,7,8,9,10 you have maxed out your bins, so any FlowFiles with table11,12,13,14, etc. aren't going to get merged until a bin frees up. They will just sit in the queue and wait.
Further, your Merge config is also only set with Min 1 and Max 1000 - meaning you need 1000 records with TableName = table1, before those files are merged and the bin is released.
With 5000 FlowFiles making up 3MB, I'm going to assume there aren't many Records per FlowFile, so you aren't filling up 1000 Records and releasing any bins.
So, double check that your TableName attribute is being set as you expected, and consider modifying the setting for controlling the merge. You could lower the Max Records from 1000 to trigger sooner, you could add a Max Size, or you could add a Max Age to time-bound it.

exception: org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update

I have a simple Spark job that streams data to a Delta table.
The table is pretty small and is not partitioned.
A lot of small parquet files are created.
As recommended in the documentation (https://docs.delta.io/1.0.0/best-practices.html) I added a compaction job that runs once a day.
val path = "..."
val numFiles = 16
spark.read
.format("delta")
.load(path)
.repartition(numFiles)
.write
.option("dataChange", "false")
.format("delta")
.mode("overwrite")
.save(path)
Every time the compaction job runs the streaming job gets the following exception:
org.apache.spark.sql.delta.ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again.
I tried to add the following config parameters to the streaming job:
spark.databricks.delta.retryWriteConflict.enabled = true # would be false by default
spark.databricks.delta.retryWriteConflict.limit = 3 # optionally limit the maximum amout of retries
It doesn't help.
Any idea how to solve the problem?
When you're streaming the data in, small files are being created (additive) and these files are being referenced in your delta log (an update). When you perform your compaction, you're trying to resolve the small files overhead by collating the data into larger files (currently 16). These large files are created alongside the small, but the change occurs when the delta log is written to. That is, transactions 0-100 make 100 small files, compaction occurs, and your new transaction tells you to now refer to the 16 large files instead. The problem is, you've already had transactions 101-110 occur from the streaming job while the compaction was occurring. After all, you're compacting ALL of your data and you essentially have a merge conflict.
The solution is is to go to the next step in the best practices and only compact select partitions using:
.option("replaceWhere", partition)
When you compact every day, the partition variable should represent the partition of your data for yesterday. No new files are being written to that partition, and the delta log can identify that the concurrent changes will not apply to currently incoming data for today.

Fetching data from Greenplum table in the order of 600 million in Apache NiFi is giving GC overhead limit exceeded

I am trying to fetch data from Greenplum table using Apache NiFi - QueryDatabaseTableRecord. I am seeing GC overhead limit exceeded error and the NiFi webpage becomes unresponsive.
I have set the 'Fetch Size' property to 10000 but it seems the property is not being used in this case.
Other settings:
Database Type : Generic
Max Rows Per Flow File : 1000000
Output Batch Size : 2
jvm min/max memory allocation is 4g/8g
Is there an alternative to avoid the GC errors for this task ?
this is a clear case of the "Fetch Size" parameter not being used, see processor info on this.
Try to test the jdbc setFetchsize on its own to see if it works.

HBase HFiles size generation

I am working on an HBase cluster with 28 region servers.
I have a table, which uses a wide-table definition. The row key is a Hex string, while each row has exactly one column family, which in turn has 80 qualifiers.
Each qualifier name is an int (starting from 1 to 80) and each value is a long.
The table has been presplited into 28 regions, using the classic getHexSplits method defined in the HBase manual here.
I have a Map-Reduce job which creates the table, and has to load about 1.8 TB of data in it.
I am using HFileOutputStream to create the HFiles. The problem is that, despite the fact that the job is configured with 28 reducers, and hbase.hregion.max.filesize is set to the default (10GB), I get a lot more(1149 of aprox 1.61 GB each!) HFiles that I expect.
The problem is that, once the table gets created, and the HFiles are being loaded, the table immediately starts both MAJOR and MINOR compactions, which triggers lots of I/O and affect my next Map-Reduce job which does reads from the table. I suppose this happens since there are multiple HFiles per region, and HBase tries to compact them to optimize the reads?
How can I make sure I get a lesser number of HFiles, in order to avoid the compactions? What would be ideal to set as the number of regions for the table, and what other parameters can I set to make sure I get no compactions?
My table is written only once, and then used just for reads.

How to solve Heap Space Error in Talend Enterprise Big Data

I am using the tFileInputJson and tMongoDBOutput components to store JSON data into a MongoDB Database.
When trying this with a small amount of data (nearly 100k JSON objects), the data can be stored into database with out any problems.
Now my requirement is to store nearly 300k JSON objects into the database and my JSON objects look like:
{
"LocationId": "253b95ec-c29a-430a-a0c3-614ffb059628",
"Sdid": "00DlBlqHulDp/43W3eyMUg",
"StartTime": "2014-03-18 22:22:56.32",
"EndTime": "2014-03-18 22:22:56.32",
"RegionId": "10d4bb4c-69dc-4522-801a-b588050099e4",
"DeviceCategories": [
"ffffffff-ffff-ffff-ffff-ffffffffffff",
"00000000-0000-0000-0000-000000000000"
],
"CheckedIn": false
}
While I am performing this operation I am getting the following Exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
[statistics] disconnected
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuffer.append(StringBuffer.java:237)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toString(Unknown Source)
at samplebigdata.retail_store2_0_1.Retail_Store2.tFileInputJSON_1Process(Retail_Store2.java:1773)
at samplebigdata.retail_store2_0_1.Retail_Store2.runJobInTOS(Retail_Store2.java:2469)
at samplebigdata.retail_store2_0_1.Retail_Store2.main(Retail_Store2.java:2328)
Job Retail_Store2 ended at 15:14 10/11/2014. [exit code=1]
My current job looks like:
How can I store so much data into the database in a single job?
The issue here is that you're printing the JSON object to the console (with your tLogRow). This requires all of the JSON objects to be held in memory before finally being dumped all at once to the console once the "flow" is completed.
If you remove the tLogRow components then (in a job as simple as this) Talend should only hold whatever the batch size is for your tMongoDbOutput component in memory and keep pushing batches into the MongoDB.
As an example, here's a screenshot of me successfully loading 100000000 rows of randomly generated data into a MySQL database:
The data set represents about 2.5 gb on disk when as a CSV but was comfortably handled in memory with a max heap space of 1 gb as each insert is 100 rows so the job only really needs to keep 100 rows of the CSV (plus any associated metadata and any Talend overheads) in memory at any one point.
In reality, it will probably keep significantly more than that in memory and simply garbage collect the rows that have been inserted into the database when the max memory is close to being reached.
If you have an absolute requirement for logging the JSON records that are being successfully put into the database then you might try outputting into a file instead and stream the output.
As long as you aren't getting too many invalid JSON objects in your tFileInputJson then you can probably keep the reject linked tLogRow as it will only receive the rejected/invalid JSON objects and so shouldn't run out of memory. As you are restricted to small amounts of memory due to being on a 32 bit system you might need to be wary that if the amount of invalid JSON objects grows you will quickly exceed your memory space.
If you simply want to load a large amount of JSON objects to a MongoDB database then you will probably be best off using the tMongoDBBulkLoad component. This takes a flat file (either .csv .tsv or .json) and loads this directly into a MongoDB database. The documentation I just linked to shows all the relevant options but you might be particularly interested by the --jsonArray additional argument that can be passed to the database. There is also a basic example in how to use the component.
This would mean you couldn't do any processing mid way through the load and you are having to use a preprepared json/csv file to load the data but if you just want a quick way to load data into the database using Talend then this should cover it.
If you needed to process chunks of the file at a time then you might want to look at a much more complicated job with a loop where you load n records from your input, process them and then restart the processing part of the loop but selecting n records with a header of n records and then repeat with a header of 2n records and so on...
Garpmitzn's answer pretty much covers how to change JVM settings to increase memory space but for something as simple as this you just want to reduce the amount you're keeping in memory for no good reason.
As an aside, if you're paying out for an Enterprise licence of Talend then you should probably be able to get yourself a 64 bit box with 16 gb of RAM easily enough and that will drastically help with your development. I'd at least hope that your production job execution server has a bunch of memory.
i feel you are reading into memory of talend. you have to play with java JVM parameters like Xms and XmX - you can increase Xmx to say bigger size then what its currently set for you say if its set to Xmx2048 then increase it to Xmx4096 or otherwise..
these parameters are available in .bat/.sh file of exported job or in talend studio you can find them under Run Job tab Advance settings JVM Settings...
but its advisable, to design the job in such a way that you dont load too much in memory..

Resources