FileHelpers - string length issue? - etl

I'm moving some data about using rhino.etl
One of the tables I'm moving has a column which stores a fairly large chunk of text for each row - though it's not that huge and there are only about 2000 rows.
When running the job I get:
A first chance exception of type 'FileHelpers.FileHelpersException' occurred in FileHelpers.dll
Now, removing the large text column fixes the issue - this is the only output I get though.
Is there a restriction somewhere that dictates a limit on data size or something?
Debug output: monobin

I'm a developer of FileHelpers, can you post here the stack trace and the full message of the exception ?
For filehelpers there is no limit to the string length, but maybe the column stores the info as binary data and Rhino.Etl send it that way
Best Regards

Related

Spring Batch Metadata Issue

When I am trying to disable Spring Batch Metadata creation with the option spring.batch.initialize-schema=never and then I launch batch, nothing happen and the batch terminate immediately without running the related jobs.
In the other hand when I am trying to enable the Metadata creation, the batch work fine, I am getting the classic SERIALIZED_CONTEXT field size error. I can't always save 4GB of data in the table when I execute the batch.
How to disable definitively the Metadata creation, and have my batch still working?
Edit : I think I found a kind of solution to avoid this issue, and I would like to have your point of view. I am finally working with Metadata generation. The issue occurs when you have large set of data stored in your ExecutionContext you pass between Tasklets (we all know this is the reason). In my case it is an ArrayList of elements (POJO), retrieved from a CSV file with OpenCSV. To overcome this issue I have :
reduced the number of columns and lines in the ArrayList (because Spring Batch will serialize this ArrayList in the SERIALIZED_CONTEXT field. The more columns and lines you have the more you are sure to get this issue)
changed the type of the SERIALIZED_CONTEXT from TEXT to LONGTEXT
deleted the toString() method defined in the POJO (not sure it really helps)
But I am still wondering, what if you have no choice and you have to load all your columns, what is the best way to prevent this issue?
So this is not an issue with metadata generation but with passing a large amount of data between two steps.
what if you have no choice and you have to load all your columns, what is the best way to prevent this issue?
You can still load all columns but you have to reduce the chunk size. The whole point of chunk processing in Spring Batch is to not load all data in memory. What you can do in your case is to carefully choose a chunk size that fits your requirement. There is no recipe for choosing the correct chunk size (since it depends on the number of columns, the size of each column, etc), so you need to proceed in an empirical way.

Doing String length on SQL Loader input field

I'm reading data from a fixed length text file and loading into a table with fixed length processing.
I want to check the input line length so that i'd discard the records that are not matching the fixed length and logging them into an Error Table.
Example
Load into Input_Log table if line is meeting the specified length
Load into Input_Error_Log table if the input line length is less than or greater than the fixed line length.
I believe you would be better served by bulk loading your data into a staging table, then load into the production table from there via a stored procedure where you can apply rules via normal PL/SQL & DML to your heart's content. This is a typical best practice anyway.
sqlldr isn't really the tool to get too complicated in, even if you could do what you want. Maintainability and restart-ability become more complicated when you add complexity to a tool that's really designed for bulk loading. Add the complexity to a proper program.
Let us know what you come up with.

Multiple small inserts in clickhouse

I have an event table (MergeTree) in clickhouse and want to run a lot of small inserts at the same time. However the server becomes overloaded and unresponsive. Moreover, some of the inserts are lost. There are a lot of records in clickhouse error log:
01:43:01.668 [ 16 ] <Error> events (Merger): Part 201 61109_20161109_240760_266738_51 intersects previous part
Is there a way to optimize such queries? I know I can use bulk insert for some types of events. Basically, running one insert with many records, which clickhouse handles pretty well. However, some of the events, such as clicks or opens could not be handled in this way.
The other question: why clickhouse decides that similar records exist, when they don't? There are similar records at the time of insert, which have the same fields as in index, but other fields are different.
From time to time I also receive the following error:
Caused by: ru.yandex.clickhouse.except.ClickHouseUnknownException: ClickHouse exception, message: Connect to localhost:8123 [ip6-localhost/0:0:0:0:0:0:0:1] timed out, host: localhost, port: 8123; Connect to ip6-localhost:8123 [ip6-localhost/0:0:0:0:0:0:0:1] timed out
... 36 more
Mostly during project build when test against clickhouse database are run.
Clickhouse has special type of tables for this - Buffer. It's stored in memory and allow many small inserts with out problem. We have near 200 different inserts per second - it works fine.
Buffer table:
CREATE TABLE logs.log_buffer (rid String, created DateTime, some String, d Date MATERIALIZED toDate(created))
ENGINE = Buffer('logs', 'log_main', 16, 5, 30, 1000, 10000, 1000000, 10000000);
Main table:
CREATE TABLE logs.log_main (rid String, created DateTime, some String, d Date)
ENGINE = MergeTree(d, sipHash128(rid), (created, sipHash128(rid)), 8192);
Details in manual: https://clickhouse.yandex/docs/en/operations/table_engines/buffer/
This is known issue when processing large number of small inserts into (non-replicated) MergeTree.
This is a bug, we need to investigate and fix.
For workaround, you should send inserts in larger batches, as recommended: about one batch per second: https://clickhouse.tech/docs/en/introduction/performance/#performance-when-inserting-data.
I've had a similar problem, although not as bad - making ~20 inserts per second caused the server to reach a high loadavg, memory consumption and CPU use. I created a Buffer table which buffers the inserts in memory, and then they are flushed periodically to the "real" on-disk table. And just like magic, everything went quite: loadavg, memory and CPU usage came down to normal levels. The nice thing is that you can run queries against the buffer table, and get back matching rows from both memory and disk - so clients are unaffected by the buffering. See https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/
Alternatively, you can use something like https://github.com/nikepan/clickhouse-bulk: it will buffer multiple inserts and flush them all together according to user policy.
The design of clickhouse MergeEngines is not meant to take small writes concurrently. The MergeTree as much as I understands merges the parts of data written to a table into based on partitions and then re-organize the parts for better aggregated reads. If we do small writes often you would encounter another exception that Merge
Error: 500: Code: 252, e.displayText() = DB::Exception: Too many parts (300). Merges are processing significantly slow
When you would try to understand why the above exception is thrown the idea will be a lot clearer. CH needs to merge data and there is an upper limit as to how many parts can exist! And every write in a batch is added as a new part and then eventually merged with the partitioned table.
SELECT
table, count() as cnt
FROM system.parts
WHERE database = 'dbname' GROUP BY `table` order by cnt desc
The above query can help you monitor parts, observe while writing how the parts would increase and eventually merge down.
My best bet for the above would be buffering the data set and periodically flushing it to DB, but then that means no real-time analytics.
Using buffer is good, however please consider these points:
If the server is restarted abnormally, the data in the buffer is lost.
FINAL and SAMPLE do not work correctly for Buffer tables. These conditions are passed to the destination table, but are not used for processing data in the buffer
When adding data to a Buffer, one of the buffers is locked. (So no reads)
If the destination table is replicated, some expected characteristics of replicated tables are lost when writing to a Buffer table. (no deduplication)
Please read throughly, it's a special case engine: https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/

BATCH_STEP_EXECUTION (in Oracle) EXIT_MESSAGE (2500 bytes) too small

ETA: Clarifying context: By default, BATCH_STEP_EXECUTION.EXIT_MESSAGE (populated with error code of failed step) is defined as VARCHAR2(2500).
When a step fails, the error message is typically stack trace on order of 10k - 15k. The first 2500 characters rarely gives insight into problem. Two questions:
1) Can I safely change the column type from VARCHAR2(2500) to VARCHAR2(4000)? Or better still, CLOB?
2) Do I need to make any changes in Spring Batch to say, "It's okay to send exit_message of 4000, or unlimited with CLOB, rather than cutting it off at 2500 characters"?
#KevinKirkpatrick,
I believe you can increase the EXIT_MESSAGE length to varchar2(4000) or change the datatype to a CLOB. Both of these options seem to work based on my testing.
The jobRepository needs to be updated to include the max_varchar-length.
For Example:
<batch:job-repository id="j1" max-varchar-length="100000"/>
I don't know if this is a good solution though. Changing the datatype on a Spring Batch table seems less than ideal.

Out of Memory Exception in mvc 5?

I need to display 40,000 records, I got system out of memory exception in MVC 5. Sometimes 70,000 records loads correctly and sometimes not even 40,000 records load. I need to display all records and export these records to the MS-Excel.
I used kendo grid to display the records.
I saw somewhere kendo grid doesn't load huge number of records.
From the Telerik forum:
When OpenAccess executes a query the actual retrieval of results is split into chunks. There is a fetch size that determines the number of records that are read from the database in a single pass. With a query that returns a lot of records this means that the fetch size is not exceeded and not all 40 000 records will be retrieved at one time in memory. Iterating over the result data you will get several reads from the database until the iteration is over. However, when you iterate over the result set subsequent reads are accumulated when you keep references to the objects that are iterated.
An out of memory exception may be caused when you operate with all the records from the grid. The way to avoid such an error would be to work with the data in chunks. For example, a paging for the grid and an option that exports data sequentially from all pages will achieve this. The goal is to try to reduce the objects kept in-memory at a time and let the garbage collection free unneeded memory. A LINQ query with Skip() and Take() is ideal in such cases where having all the data in-memory is costly.
and from http://docs.telerik.com/devtools/aspnet-ajax/controls/grid/functionality/exporting/overview
We strongly recommend not to export large amounts of data since there is a chance to encounter an exception(Timeout or OutOfMemory) if more than one user tries to export the same data simultaneously. RadGrid is not suitable for such scenarios and therefore we suggest that you limit the number of columns and rows. Also it is important to note that the hierarchy and the nested controls have a considerable effect on the performance in this scenario.
What the above is basically saying, is to reduce your result set via paging and/or reducing the number of columns that are fetched from the db to show only what is actually needed.
Not really sure what else you could do. You have too much data, and you're running out of memory. Gotta reduce the data to reduce the memory used.
Please go for the paging.And try to export all 40000 records without loding on to page. Which clears you that load data takes time and goes to memory out of exception.

Resources