Suggestion technology/design on this BIGDATA usecase - hadoop

I am new to big data technologies and design so looking for help from java world.
I have concept of tags and tagcombinations.
For example U.S.A and Pen are two tags AND if they come together in some definition then register a tagcombination(U.S.A-Pen) for that..
tags (U.S.A, Pen, Pencil, India, Shampoo)
tagcombinations(U.S.A-Pen, India-pencil, U.S.A-Pencil, India-Pen, India-Pen-Shampoo)
millions of tags
billions of tagcombinations
one tagcombination generally have 2-8 tags....
Every day we get lakhs of new tagcombinations to write
daily crores of queries to find matching combination by set of tags
Query need to support :
one tag or set of tags appears in how many tagcombinationids ????
If i query for Pen,India then it should return two tagcombinaions (India-Pen, India-Pen-Shampoo))..Query will be fired by application in realtime.
Please suggest a solution which is distributed with java client and can
handle scale of data i am looking for..
Already tried on cassandra but not able toconclude that as right match for my problem..
Thanks
Naresh

I suggest you look into Apache Lucene project:
http://lucene.apache.org/
You won't be able to use Cassandra directly for this but if you store your data inside Cassandra, you can use Solr to add extra indexes on top of your data. DataStax has a bundle solution called DataStax Enterprise that has Cassandra/Solr together:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise

Related

Using ElasticSearch as a permanent storage

Recently I am working on a project which is producing a huge amount of data every day, in this project, there are two functionalities, one is storing data into Hbase for future analysis, and second one is pushing data into ElasticSearch for monitoring.
As the data is huge, we should store data into two platforms(Hbase,Elasticsearch)!
I have no experience in both of them. I want no know is it possible to use elasticsearch instead of hbase as a persistence storage for future analytics?
I recommend you reading this old but still valid article : https://www.elastic.co/blog/found-elasticsearch-as-nosql
Keep in mind, Elasticsearch is only a search engine. But it depends if your data are critical or if you can accept to lose some of them like non critical logs.
If you don't want to use an additionnal database with huge large data, you probably can store them into files in something like HDFS.
You should also check Phoenix https://phoenix.apache.org/ which may provide the monitoring features that you are looking for

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

Elasticsearch to index RDBMS data

These are three simple questions which was surprisingly hard to find definite answers.
Does ElasticSearch support indexing data in RDBMS tables ( Oracle/SQLServer/Informix) out of the box?
If yes, can you please point me to documentation on how to do it
If not, what are alternate ways (plugins like Rivers - deprecated) with good reputation
I'm surprised there isn't any solid answer as yet for this. So here's the solution. Logstash directly gives us the ability to push data from a RDBMS into Elasticsearch.
Here's a link to a tutorial which tell you how to go about it. Briefly(all details in link 1), you simply need a JDBC driver for the relational database you'll be using (Postgres, MySQL etc) and make a config file specifying your input as the Relational Database and your output as Elasticsearch. You can also specify a cron which would allow you to keep updating one regular intervals.
Here's the article which mentions the configuration and gets you started (See Example 2): https://www.elastic.co/blog/logstash-jdbc-input-plugin
Here's the article which tells you how to configure the Cronjob as such: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#_scheduling

Adding Advanced Search in ASP.NET MVC 3 / .NET

In a website I am working on, there is an advanced search form with several fields, some of them dynamic that show up / hide depending on what is being selected on the search form.
Data expected to be big in the database and records are spread over several tables in a very normalized fashion.
Is there a recommendation on using a 3rd part search engine, sql server full text search, lucene.net, etc ... other than using SELECT / JOIN queries?
Thank you
Thinking a little outside the box here -
Check out CSLA.NET; Using this framework you can create business objects and "denormalise" your search algorithm.
Either way, be sure the database has proper indexes in place for better performance.
On the frontend youre going to need to use some javascript to map which top level fields show sub level fields. Its pretty straight forward.
For the actual search, I would recommend some flavor of Lucene.
You have your option of the .NET flavor of Lucene.NET which Stackoverflow uses, Solr which is arguably easier to setup and get running than Lucene is, or the newest kid on the block which is ElasticSearch which aims to be schema free and infinitely scalable simply by dropping more instances in the cluster.
I have only used Solr myself, and it has a nice .NET client (SolrNet).
first index your database field that is important and very usable
and for search better use full text search
i try it and result is very different from when i dont use full text
and better use select and join query in stored proc and call sp from your program

Hbase / Hadoop Query Help

I'm working on a project with a friend that will utilize Hbase to store it's data. Are there any good query examples? I seem to be writing a ton of Java code to iterate through lists of RowResult's when, in SQL land, I could write a simple query. Am I missing something? Or is Hbase missing something?
I think you, like many of us, are making the mistake of treating bigtable and HBase like just another RDBMS when it's actually a column-oriented storage model meant for efficiently storing and retrieving large sets of sparse data. This means storing, ideally, many-to-one relationships within a single row, for example. Your queries should return very few rows but contain (potentially) many datapoints.
Perhaps if you told us more about what you were trying to store, we could help you design your schema to match the bigtable/HBase way of doing things.
For a good rundown of what HBase does differently than a "traditional" RDBMS, check out this awesome article: Matching Impedance: When to use HBase by Bryan Duxbury.
If you want to access HBase using a query language and a JDBC driver it is possible. Paul Ambrose has released a library called HBQL at hbql.com that will help you do this. I've used it for a couple of projects and it works well. You obviously won't have access to full SQL, but it does make it a little easier to use.
I looked at Hadoop and Hbase and as Sean said, I soon realised it didn't give me what I actually wanted, which was a clustered JDBC compliant database.
I think you could be better off using something like C-JDBC or HA-JDBC which seem more like what I was was after. (Personally, I haven't got farther with either of these other than reading the documentation so I can't tell which of them is any good, if any.)
I'd recommend taking a look at Apache Hive project, which is similar to HBase (in the sense that it's a distributed database) which implements a SQL-esque language.
Thanks for the reply Sean, and sorry for my late response. I often make the mistake of treating HBase like a RDBMS. So often in fact that I've had to re-write code because of it! It's such a hard thing to unlearn.
Right now we have only 4 tables. Which, in this case, is very few considering my background. I was just hoping to use some RDBMS functionality while mostly sticking to the column-oriented storage model.
Glad to hear you guys are using HBase! I'm not an expert by any stretch of the imagination, but here are a couple of things that might help.
HBase is based on / inspired by BigTable, which happens to be exposed by AppEngine as their db api, so browsing their docs should help a great deal if you're working on a webapp.
If you're not working on a webapp, the kind of iterating you're describing is usually handled with via map/reduce (don't emit the values you don't want). Skipping over values using iterators virtually guarantees your application will have bottlenecks with HBase-sized data sets. If you find you're still thinking in SQL, check out cloudera's pig tutorial and hive tutorial.
Basically the whole HBase/SQL mental difference (for non-webapps) boils down to "Send the computation to the data, don't send the data to the computation" -- if you keep that in mind while you're coding you'll do fine :-)
Regards,
David

Resources