Use Case:
An application uses spark to process data for 5 minutes, the data to be processed could be of several hundred thousands of records in data storage.
The choice for data storage is Elastic Search.
Issue:
Do we have a connector for the spark in elasticsearch similar to the connector in MongoDB?
https://www.mongodb.com/products/spark-connector.
Investigation:
I spent a lot of time but the best I could find was a solution using search API with scroll(we can fetch the limited number of records for given number interval), but this does not fit my use-case.
Please note that my elastic search will have JSON data and we do not want to save RDD.
as mentioned in below
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
You can use spark connector for ES , and data is not saved in any binary form - but RDD/Dataframe is serialized as JSON and thats what goes into Elasticsearch.
Related
I have been working on ways to import Google Analytics raw data without having to use a premium account .So far this is the nearest link to what I want to do
How to extract data from Google Analytics and build a data warehouse (webhouse) from it?
I want to load that data into elastic search and display using kibana .What is the best ETL approach for this ? Has anyone tried to display GA data using ELK stack ?
You should do it in two times
First, get the info, a very very useful site is https://developers.google.com/webmaster-tools/v3/how-tos/search_analytics but you have first to have a google wembaster tool account and create oauth credential on https://console.developers.google.com/apis
Then once you have your data, find a way to import them in elasticsearch, I'm still looking for the best way to do so, maybe transform the result table into csv and then using https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html
Have a look at this:
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-http_poller.html
You can use this to poll an endpoint, in this case GA, and load the response data into Elasticsearch. You may want to filter the response with the Split and / or Mutate plugins as well.
I have done this same setup.
Extracted data from Google Analytics with 7 Dimensions and 6 Metrics, out of which 2 Dimensions were primary key (Timestamp and ID). This was done using R.
Did some transformations on the data using linux awk and sed commands.
Loaded the data into Apache Hive with the row column formatting, created like total 9 tables.
Joined all the 9 tables in Hive using Hive Join queries, with 2 primary keys.
Used elasticsearch-hadoop connector to load the final resulting table to elasticsearch. Had to do a little data transformations to match Hive and Elasticsearch data types.
Used Kibana to visualize the data in Elasticsearch.
Now I am planning to avoid all the manual steps and somehow automate all the steps above.
I need to use SolrCloud as the search engine on top of HBase and HDFS for searching a very large num of documents.
Currently these docs are in different data sources. I am getting confused whether Solr should search, index and store these docs within itself or Solr should just be used for indexing and docs along with their metadata of the docs should reside in HBAse/HDFS layer.
I have tried searching how the Solr HBase integration works best (meaning what should be done at the Solr level and what at the Hadoop level) but in vain. Anyone has done this kind of Big Data search earlier and can give some pointers? Thanks
Solr provides fast search via its indexes. Solr uses inverted indexes for this. So, you index documents to solr, it creates the indexes. Based on how you have defined the schema.xml, solr decides how the indexes has to be created. The indexes and the field values are stored in HDFS (based on your config in solrconfig.xml)
With respect to Hbase, you can directly query run you query on hbase without having to use Solr. SolrBase is an SOLR and Hbase integration available. Also have a look at liliy
The good design followed is search for things in solr, get the id of the records quickly, and then if needed, fetch the entire record from Hbase. You need to make sure that entire data is there in hbase, and only sufficient data is indexed. Needless to say that both solr and hbase should be in sync. One ready made framework, is NGDATA/hbase indexer here.
Solr works wonders to get the counts, grouping counts, stats. So once you get those numbers and their id's, Hbase can take over. once u have row key in hbase(id), you get low latency search results, that suits well with web applications too
I am new to elasticsearch and want to index my website logs which are stored on HDFS for fast querying.
I have a well structured pipeline which runs a script every 20 minutes to ingest the data into HDFS.
I want to integrate elasticsearch with it, so that it also indexes these logs based on particular field(s) and thereby giving faster query results using spark SQL.
So, my question is, can I index my data based on particular field(s) only?
Also, my logs are saved in avro file format. Does es provides a way to directly index avro serialized data or do I need to convert it into some other format?
Thank you in advance.
I would suggest you to look at Elasticsearch, Logstash and Kibana stack that should be good enough to full fill your requirement. Putting it on HDFS and then using ES would be additional overhead.
Instead, you can use Logstash to pump data into ES, index on whatever fields you wish to query and build easy dashboards in less than 10 minutes of exercise. Take a look at this tutorial for better step-by-step guide.
http://hadooptutorials.co.in/tutorials/elasticsearch/log-analytics-using-elasticsearch-logstash-kibana.html
Iam looking for a solution that How can i sync cassandra with elastic search i.e. whatever data is written to cassandra ,it should be written to elastic search also
One way would be to develop a capture process that runs on a schedule or in a loop and reads "new" data from Cassandra and sends it to Elasticsearch. The definition of "new" data depends on your Cassandra schema and your applications' insert/update/delete patterns.
This can be either done in real time or as the separate batch process .
As and when you get the data to the Cassandra based on the retrieval pattern /query pattern index the data to the Elastic search.
Have the real time processing layer(such as Strom)/Distributed Message broke like Kafka to sync up the data .
Or periodically query the data from Cassandra and ingest to the ES as the separate batch job.
I went through Couchbase xcdr replication documentation, but failed to understand below point:
1. couchbase replicate the all the data in bucket in batches to elstic search. And elastic search provide the indexing for these data for realtime statical data. My question is if all the data is replicated to elsastic search , then in this case elastic search is like database which can hold huge amount of data. So can we replace couchbase with elastic search?
2.how the data in form json is send to d3.js for display statical graph.
All of the data is replicated to Elastic Search, but is not held there by default. The indexes and such are created, but the documents are discarded. Elastic Search is not a database and does not perform like one and certainly not on the level of Couchbase. Take a look at this presentation where it talks about performance and stuff and why Cochbas
If your data are not critical or if you have another source of truth, you can use Elasticsearch only.
Otherwise, I'd keep Couchbase and Elasticsearch.
There is a resiliency page on Elastic.co website which describes potential known problems. https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html
My 2 cents.