I have configured spark-submit with
"--conf",
"spark.sql.autoBroadcastJoinThreshold=536870912", 512MB
But the DAG is still not broadcasting the smaller side of the join.
The code is a simple join. So I'm wondering what is wrong.
The input are files of parquet, stored on S3.
If more information is needed for further analyse, please let me know.
According to this blog,
BHJ is not supported for a full outer join. For right outer join, the only left side table can be broadcasted and for other left joins only right table can be broadcasted.
That is the reason the broadcast is not happening.
My guess would be that the configuration spark.sql.autoBroadcastJoinThreshold is overwritten somewhere or is not correctly set. You should check in the Spark UI the Environment tab if you find it and check if it's correctly set.
If you just need a quick fix, you can also force the broadcast with the hint .broadcast on the Dataset that you already know is small.
Related
I understood that clickhouse is eventually consistent. So once an insert call returns, it doesn't mean that the data will appear in a select query.
does that apply to stand-alone clickhouse (no distribution, no replication)?
I understand the concept of eventual consistency for data replication, but does it apply with distribution but no replication?
using a distributed+replicated clickhouse, what is a recommended way to know that some insert(s) can be safely looked up?
Basically I didn't find much information on this topic, so maybe I am not asking the best questions. Feel free to enlighten me.
No, but single-node setup shouldn't be considered reliable either.
By default yes, you'll insert to node the client is connected to (probably via some load balancer) and Distributed table will asynchronously forward each piece of data to node where it belongs. The insert_distributed_sync=1 setting will make the client wait synchronously.
On insert use ***MergeTree shard tables directly (not Distributed) with insert_quorum=2 setting (if there are 3 replicas) and retry infinitely with exactly same batch if there are some errors (can use different replicas on retry, since there's a deduplication based on batch hash). Then on reads use select_sequential_consistency=1 setting.
My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.
I'm working on a Hadoop / Cassandra integration I have a couple of questions I was hoping someone could help me with.
First, I seem to require the source table/cf to have been created with the option WITH COMPACT STORAGE otherwise I get an error can't read keyspace in map/reduce code.
I was wondering if this is just how it needs to be?
And if this is the case, my second question was, is it possible/how do I add the WITH COMPACT STORAGE option on to a pre-exsting table? .. or am I going to have to re-create them and move data around.
I am using Cassandra 1.2.6
thanks in advance
Gerry
I'm assuming you are using job.setInputFormatClass(ColumnFamilyInputFormat.class);
Instead, try using job.setInputFormatClass(CqlPagingInputFormat.class);
The Mapper input for this is Map<String, ByteBuffer>, Map<String,ByteBuffer>
Similarly, if you need to write out to Cassandra us CqlPagingOutputFormat and the appropriate output types.
See http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive for more info.
#Gerry
The "WITH COMPACT STORAGE" thing is CQL3 syntax to create tables structure compatible with Thrift clients and legacy column families.
Essentially, when using this option, the table, or should I say column family, is created without using any Composite.
You should know that CQL3 tables heavily rely on composites to work.
Now to answer your questions:
I was wondering if this is just how it needs to be?
Probably because your map/reduce code cannot deal with Composites. But I believe in version 1.2.6 of Cassandra, you have all the necessary code to deal with CQL3 tables. Look at classes in package org.apache.cassandra.hadoop.
is it possible/how do I add the WITH COMPACT STORAGE option on to a pre-exsting table?
No, it's not possible to modify/change table structure once created. You'll need some kind of migrations.
I am working on million Rows and columns in hbase 0.92.1. Now, I want to know how to create secondary index using Co-processor. Give some examples program for this.
Kindly give the program which supports hbase 0.92.1.
There's no single great way to do secondary indexing with HBase. The way you will approach the problem will be dictated by your data and your use case. Some good discussion of secondary indexing is located here
As far as I know, prior to 0.20, in Hbase API, you'd have HTableDescriptor which is writable still, so you can call HtabelDescriptor.addIndex() to create indices against the columns.
Example can be found here.
Then indexing starts moving to IHbase see the Jira story here.
To answer your question, in 0.92.1, I don't think there is anything out of box yet, you will have to write the coprocessor yourself, But There is a jira story for coprocessor secondary index you might want to watch the progress:)
meanwhile you can try idxColumnDescriptor here, also looking at the test TestIdxColumnDescriptor.java might help.
I'm working on a project with a friend that will utilize Hbase to store it's data. Are there any good query examples? I seem to be writing a ton of Java code to iterate through lists of RowResult's when, in SQL land, I could write a simple query. Am I missing something? Or is Hbase missing something?
I think you, like many of us, are making the mistake of treating bigtable and HBase like just another RDBMS when it's actually a column-oriented storage model meant for efficiently storing and retrieving large sets of sparse data. This means storing, ideally, many-to-one relationships within a single row, for example. Your queries should return very few rows but contain (potentially) many datapoints.
Perhaps if you told us more about what you were trying to store, we could help you design your schema to match the bigtable/HBase way of doing things.
For a good rundown of what HBase does differently than a "traditional" RDBMS, check out this awesome article: Matching Impedance: When to use HBase by Bryan Duxbury.
If you want to access HBase using a query language and a JDBC driver it is possible. Paul Ambrose has released a library called HBQL at hbql.com that will help you do this. I've used it for a couple of projects and it works well. You obviously won't have access to full SQL, but it does make it a little easier to use.
I looked at Hadoop and Hbase and as Sean said, I soon realised it didn't give me what I actually wanted, which was a clustered JDBC compliant database.
I think you could be better off using something like C-JDBC or HA-JDBC which seem more like what I was was after. (Personally, I haven't got farther with either of these other than reading the documentation so I can't tell which of them is any good, if any.)
I'd recommend taking a look at Apache Hive project, which is similar to HBase (in the sense that it's a distributed database) which implements a SQL-esque language.
Thanks for the reply Sean, and sorry for my late response. I often make the mistake of treating HBase like a RDBMS. So often in fact that I've had to re-write code because of it! It's such a hard thing to unlearn.
Right now we have only 4 tables. Which, in this case, is very few considering my background. I was just hoping to use some RDBMS functionality while mostly sticking to the column-oriented storage model.
Glad to hear you guys are using HBase! I'm not an expert by any stretch of the imagination, but here are a couple of things that might help.
HBase is based on / inspired by BigTable, which happens to be exposed by AppEngine as their db api, so browsing their docs should help a great deal if you're working on a webapp.
If you're not working on a webapp, the kind of iterating you're describing is usually handled with via map/reduce (don't emit the values you don't want). Skipping over values using iterators virtually guarantees your application will have bottlenecks with HBase-sized data sets. If you find you're still thinking in SQL, check out cloudera's pig tutorial and hive tutorial.
Basically the whole HBase/SQL mental difference (for non-webapps) boils down to "Send the computation to the data, don't send the data to the computation" -- if you keep that in mind while you're coding you'll do fine :-)
Regards,
David