Checksum doesn't match: corrupted data.: while reading column `cid` at /opt/clickhouse//data/click - clickhouse

I am using clickhouse to store data, and I'm getting the following error while querying the column cid from the click table.
Checksum doesn't match: corrupted data.
I don't have any replicate for now, any suggestions for recovery?

The error comes down to the fact the checksum of the CityHash128 and the compressed data doesn't match and throws this exception in the readCompressedData function.
You can try to disable this check using the disable_checksum via the disableChecksumming method.
It could work, but a corrupted most probably means that something is wrong with your raw data and there is small chances for recovery unless you did backups.

Usually, you will get data part name and column name in exception message.
You could locate specific data part, remove files related to that single column, and restart the server. You will lose (already corrupted) data for one column in one data part (it will be filled with default values on read), but all other data will remain.

Related

How to use putSQL in apache nifi

I a beginner in data warehousing and apache nifi. I was trying taking the Mysql table data into nifi and then want to put that data into another mysql database table, I am successfully getting data from the database table one and I can also able to print that data into file using putFile processor.
But now I want to store that queued data into mysql database table, I know there is putSQL processor but it was not working for me.
Can anyone let me know how to do it correctly.
Here are the screenshots of my flow
PutSQL configuration-
I converted data from Avro to JSON and then JSON to SQL in case if that would work, but this also not worked.
Use PutDatabaseRecord and remove the Convert* processors.
From nifi docs:
The PutDatabaseRecord processor uses a specified RecordReader to input
(possibly multiple) records from an incoming flow file. These records
are translated to SQL statements and executed as a single transaction.
If any errors occur, the flow file is routed to failure or retry, and
if the records are transmitted successfully, the incoming flow file is
routed to success. The type of statement executed by the processor is
specified via the Statement Type property, which accepts some
hard-coded values such as INSERT, UPDATE, and DELETE, as well as 'Use
statement.type Attribute', which causes the processor to get the
statement type from a flow file attribute. IMPORTANT: If the Statement
Type is UPDATE, then the incoming records must not alter the value(s)
of the primary keys (or user-specified Update Keys). If such records
are encountered, the UPDATE statement issued to the database may do
nothing (if no existing records with the new primary key values are
found), or could inadvertently corrupt the existing data (by changing
records for which the new values of the primary keys exist).
This should be more performant and cleaner.

CSV Blob Sink - Skip Writing File when 0 Rows Present

This is a relatively simple problem with (I'm hoping) a similarly-simple solution.
In my ADF ETLs, any time there's a known and expected yet unrecoverable row-based error, I don't want my full ETL to fail. Instead, I'd rather pipe those rows off to a log, which I can then pick up at the end of the ETL for manual inspection. To do this, I use conditional splits.
Most of the time, there shouldn't be any rows like this. When this is the case, I don't want my blob sink to write a file. However, the current behavior writes a file no matter what -- it's just that the file only contains the table header.
Is there a way to skip writing anything to a blob sink when there are no input rows?
Edit: Somehow I forgot to specify -- I'm specifically referring to a Mapping Data Flow with a blob sink.
You can use Lookup activity(don't check first row only) to get all your table data firstly. Then use If condition to check the count of Lookup activity's output. If its count > 0, execute next activity(or data flow).

ORA-29285: file write error

I'm trying to extract data from an Oracle table. I'm using utl file for that and I'm receiving the error ORA-29285: file write error. The weird here is if I try extract the data directly from the table return the error, if I extract the data using a simple view the error is returned as well, BUT if I extract the data using a view with an ORDER BY the extraction is well succeed. I can't understand where the error is, I already look for the length of lines and nothing. Any suggestion from which can be?
I extract a lot of other data through the utl_file and I'm well succed. This data in specific is at the first time uploaded to Oracle table directly from a csv file with ANSI encoding. However I have other data uploaded by the same way and then I can export correctly. I checked the encoding too in order to reduce the possible mistakes and I found nothing.
Many thanks,
Priscila Ferreira

apache drill memory exception

I am trying to reformat over 600gb of csv files into parquet using apache drill in a single node setup.
I run my sql statement:
CREATE TABLE AS Data_Transform.'/' AS
....
FROM Data_source.'/data_dump/*'
and it is creating parquet files but I get the error:
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR:
One or more nodes ran out of memory while executing the query.
is there a way around this?
Or is there an alternative way to do the conversion?
I don't know if querying all those GB on a local node is feasible. If you've configured the memory per the docs, using a cluster of Drillbits to share the load is the obvious solution, but I guess you already know that.
If you're willing to experiment, and you're converting csv files using a select * to query the csv, rather than selecting individual columns, change the query to something like select columns[0] as user_id, columns1 as user_name. Cast any columns to types like int, float, datetime if possible. This avoids the read overhead storing data in the varchars and prepares data for your future queries that need to be cast for any analysis.
I've also seen the following recommendation from a Drill developer: split files into smaller files manually to overcome the local file system capability limitations. Drill doesn't split files on block splits.

IBM BigSheets Issue

I am getting some error in loading my files onto big sheets both directly from the HDFS( files that are output of pig scripts) and also raw data that is lying on the local hard disk.
I have observed that whenever I am loading the files and issuing a row count to see if all data is loaded into bigsheets, then I see lesses number of rows being loaded.
I have checked that the files are consistent and proper delimeters(/t or comma separated fields).
Size of my file is around 2GB and I have used either of the format *.csv/ *.tsv.
Also in some cases when i have tired to load a file from windows os directly then the files sometimes load successfully with row count matching with actual number of lines in the data, and then sometimes with lesser number of rowcount.
Even sometimes when a fresh file being used 1st time it gives the correct result but if I do the same operation next time some rows are missing.
Kindly share your experience your bigsheets, solution to any such problems where the entire data is not being loaded etc. Thanks in advance
The data that you originally load into BigSheets is only a subset. You have to run the sheet to get it on the full dataset.
http://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/t0057547.html?lang=en

Resources