So I have the strangest thing happening with ClickHouse CSV import.
We have a CSV (TabSeparated) file (about 25M, 40k lines, many columns). When I load it to a local ClickHouse, it's loading all right. When we try to load it to a 3 server cluster, we get the following error:
Code: 62. DB::Exception: Syntax error: failed at position 321 ('some_field') (line 4, col 264): some_field = 'AIR' FORMAT JSON. Expected one of: token, DoubleColon, MOD, DIV, NOT, BETWEEN, LIKE, ILIKE, NOT LIKE, NOT ILIKE, IN, NOT IN, GLOBAL IN, GLOBAL NOT IN, IS, AND, OR, QuestionMark, alias, AS, GROUP BY, WITH, HAVING, WINDOW, ORDER BY, LIMIT, OFFSET, SETTINGS, UNION, EXCEPT, INTERSECT, INTO OUTFILE, FORMAT, end of query.
Which just doesn't really make sense.
The command to load is the same for both environments:
cat file.tsv|docker run -i --rm clickhouse/clickhouse-server clickhouse-client --host=the.host --query="INSERT INTO default.mytable FORMAT TabSeparated"
What is even more strange is that if I split the input file into two parts, it loads to the cluster as well.
(The table format is MergeTree / ReplicatedMergeTree.)
Related
This is a relatively simple problem with (I'm hoping) a similarly-simple solution.
In my ADF ETLs, any time there's a known and expected yet unrecoverable row-based error, I don't want my full ETL to fail. Instead, I'd rather pipe those rows off to a log, which I can then pick up at the end of the ETL for manual inspection. To do this, I use conditional splits.
Most of the time, there shouldn't be any rows like this. When this is the case, I don't want my blob sink to write a file. However, the current behavior writes a file no matter what -- it's just that the file only contains the table header.
Is there a way to skip writing anything to a blob sink when there are no input rows?
Edit: Somehow I forgot to specify -- I'm specifically referring to a Mapping Data Flow with a blob sink.
You can use Lookup activity(don't check first row only) to get all your table data firstly. Then use If condition to check the count of Lookup activity's output. If its count > 0, execute next activity(or data flow).
I was experimenting with inserting data into a partitioned table. Irrespective of whether or not I mention the keyword, 'repartition' or 'partition', I see the data is fitting correctly into the respective partitions. So, I am wondering, if there is any significance to the keywords shown below:
scala> input.repartition($"decade").registerTempTable("second_table")
spark.sql("insert into lakehuron partition(decade) select date,level,decade from second_table")
In the above code, I repeated the exercise twice, once removing the keyword 'repartition' in the first line and again by removing both 'repartition' in the first line and 'partition' in the second line. Both were inserting the data correctly into the respective partitions and I was able to see new files getting generated within the right partition /user/hive/warehouse/lakehuron/decade=1960
Please help me understand the significance of these key words.
I am trying to devise a crosstab report for COGNOS. I am joining a main query to a dimension query so that all crosstabs will nicely line up with one another. When I ask to view the tabular data for either the main query or the dimension query, it displays fine. The main query as about 10,000 records and the dimension query 70 records.
However, when I run the report as normal, or view the tabular data for the crosstab query, I get an RQP-DEF-0177 error.
An error occurred while performing operation 'sqlOpenResult' status='-237'.
and the error details begin:
UDA-SOR-0005 Unable to write the file.RSV-SRV-0042 Trace back:RSReportService.cpp(724)
followed by a bunch of QFException clauses.
How could I fix this problem?
This is due to insufficient disk space for Temporary files. Check IBM Cognos configuration -> Environment -> Temporary files location.
When a report is running a UDA file is created inside the temp folder and disk may not have enough space to for UDA file to expand. By increasing the disk space where the temp folder is located will resolve this issue.
This is generic error, most of the times cognos throws this error, if you want go in details go to /logs/cogserver.log , you will find more details there.
I am new to Hadoop and I am using a single node cluster (for development) to pull some data from a relational database.
Specifically, I am using Spark (version 1.4.1), Java API, to pull data for a query and write to Hive. I have run into various problems (and have read the manuals and tried searching online) but I think I might be misunderstanding some fundamental part of this because I am having problems.
First, I thought I'd be able to read data into Spark, optionally run some Spark methods to manipulate the data and then write it to Hive through a HiveContext object. But, there doesn't seem to be any way to write straight from Spark to Hive. Is that true?
So I need an intermediate step. I have tried a few different methods of storing the data first before writing to Hive and settled on writing an HDFS text file since it seemed to work best for me. However, writing the HDFS file, I get square brackets in the files, like this: [A,B,C]
So, when I load the data into Hive using the "LOAD DATA INPATH..." HiveQL statement, I get the square brackets in the Hive table!!
What am I missing? Or more appropriately, can someone please help me understand the steps I need to do to:
Run a SQL on SQL Server or Oracle DB
Write the data out to a Hive table that can be accessed by a dashboard tool.
My code right now, looks something like this:
DataFrame df= sqlContext.read().format("jdbc").options(getSqlContextOptions(driver, dburl, query)).load(); // This step seem to work fine.
JavaRDD<Row> rdd = df.javaRDD();
rdd.saveAsTextFile(getHdfsUri() + pathToFile); // This works, but writes the rows in square brackets, like: [1, AAA].
hiveContext.sql("CREATE TABLE BLAH (MY_ID INT, MY_DESC STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE");
hiveContext.sql("LOAD DATA INPATH '" + getHdfsUri() + hdfsFile + "' OVERWRITE INTO TABLE `BLAH`"); // Get's written like:
MY_INT MY_DESC
------ -------
AAA]
The INT column doesn't get written at all because the leading [ makes it no longer a numeric value and the last column shows the "]" at the end of the row in the HDFS file.
Please help me understand why this isn't working or what a better way would be. Thanks!
I am not locked into any specific approach, so all options would be appreciated.
Ok, I figured out what I was doing wrong. I needed to use the write function on the HiveContext and needed to use the com.databricks.spark.csv to write a sequencefile in Hive. This does not require an intermediate step of saving a file in HDFS, which is great, and writes to Hive successfully.
DataFrame df = hiveContext.createDataFrame(rdd, struct);
df.select(cols).write().format("com.databricks.spark.csv").mode(SaveMode.Append).saveAsTable("TABLENAME");
I did need to create a StructType object, though, to pass into the createDataFrame method for the proper mapping of the data types (Something like is shown in the middle of this page: Support for User Defined Types for java in Spark). And the cols variable is an array of Column objects which is really just an array of column names (i.e. something like Column[] cols = {new Column("COL1"), new Column("COL2")};
I think "Insert" is not yet supported.
http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
To get rid of brackets in text file, you should avoid saveAsTextFile. Instead try writing the contents using HDFS API i.e FSDataInputStream
I am trying to reformat over 600gb of csv files into parquet using apache drill in a single node setup.
I run my sql statement:
CREATE TABLE AS Data_Transform.'/' AS
....
FROM Data_source.'/data_dump/*'
and it is creating parquet files but I get the error:
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR:
One or more nodes ran out of memory while executing the query.
is there a way around this?
Or is there an alternative way to do the conversion?
I don't know if querying all those GB on a local node is feasible. If you've configured the memory per the docs, using a cluster of Drillbits to share the load is the obvious solution, but I guess you already know that.
If you're willing to experiment, and you're converting csv files using a select * to query the csv, rather than selecting individual columns, change the query to something like select columns[0] as user_id, columns1 as user_name. Cast any columns to types like int, float, datetime if possible. This avoids the read overhead storing data in the varchars and prepares data for your future queries that need to be cast for any analysis.
I've also seen the following recommendation from a Drill developer: split files into smaller files manually to overcome the local file system capability limitations. Drill doesn't split files on block splits.