Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 months ago.
Improve this question
Does Memgraph act as a sink for streaming sources from Kafka, and then once the messages have been received, organize them into a graph database? If yes, how are these messages being organized. I don't get how the messages can be transformed from the format they are in a certain topic to the something that is understandable to the graph database such as Memgraph.
Memgraph is consuming messages from Kafka with its internal consumer. This means that you need to create a stream inside Memgraph and define from which topic data will be read. Before doing that, you have to create and load a transformation module for a certain stream. Transformation module is a procedure that tells Memgraph how to transform Kafka messages to Cypher queries which will be executed for each consumed message. Once you create a stream, you have to start a stream to start ingesting data from the stream. Because of the transformation module, the data from your Kafka topic will seamlessly be stored into the graph database.
Related
I'm fairly new to Kafka connect. I'm planning to use kafka connect source to read data from my MySQL database tables into one of the kafka topics. Now, since my source table is a transactional data store, i might get a new record inserted into it or a record might be updated. Now, I'm trying to understand how can i achieve parallelism to read the data from this table and my question is,
Can i use max.tasks to achieve parallelism (have more than one thread) to read the data and push onto the kafka topic? If yes, Please explain.
Thanks
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a table in my Hbase shell with huge amounts of data and I would like to export it to a text format onto a local file system. could anyone suggest me how to do it.
I would also like to know if I could export the Hbase table onto hive or pig.
Have you tried Hbase export http://hbase.apache.org/0.94/book/ops_mgt.html which will dump contents of Hbase table to HDFS from there you can use Pig and Hive to access it , I haven't tried this myself ... but appears to address your issue from the documentation .
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am trying to make a data integration process using ETL tool (Talend).
The challenge I am facing is when I try to bring data from various sources (in different formats) into a single format.
The sources may have different column names and structures (order, datatypes, etc.). So different metadata.
As I see it, it is a very common case. But the tool is not able to handle it as it does not provide any dynamic mapping feature.
What is the best approach to handle such scenario?
Talend does provide a dynamic mapping tool. It's called the tMap or tXmlMap for XML data.
There's also the tHMap (Hierarchical Mapping tool) which is a lot more powerful but I've yet to use it at all because it's very raw in the version of Talend I'm using (5.4) but should be more usable in 5.5.
Your best approach here may be to use a tMap after each of your components to standardise the schema of your data.
First you should pick what the output schema should look like (this could be the same as one of your current schemas or something entirely different if necessary) and then simply copy and paste the schema to the output table of each tMap. Then map the relevant data across.
An example job may look something like this:
Where the schemas and the contained data for each "file" (I'm using tFixedFlowInput components to hardcode the data to the job rather than read in a file but the premise is the same) is as following:
File 1:
File 2:
File 3:
And they are then mapped to match the schema of the first "file":
File 1:
File 2:
File 3:
Notice how the first tMap configuration shows no changes as we are keeping the schema exactly the same.
Now that our inputs all share the same schema we can use a tUnite component to union (much like SQL's UNION operator) the data.
After this we then also take one final step and use a tReplace component so that we can easily standardise the "sex" field to M or F:
And then lastly I output that to the console but this could be output to any available output component.
For a truly dynamic option without having to predefine the mapping you would need to read in all of your data with a dynamic schema. You could then parse the structure out into a defined output.
In this case you could read the data from your files in as a dynamic schema (single column) and then drop it straight into a temporary database table. Talend will automatically create the columns as per the headers in the original file.
From here you could then use a transformation mapping file and the databases' data dictionary to extract the data in the source columns and map it directly to the output columns.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Does Hadoop streaming support the new columnar storage formats like ORC and parquet or are there frameworks on top of Hadoop that allows you to read such formats?
You can use HCatalog to read ORC File. https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat
It provides you an abstraction to read ORC, Text, Sequence, RC files. I am not sure if there is support of parquet there. Nonetheless if this doesn't sound reasonable, you can use ORC record readers in the Hive code base to read ORC Files (ORCInputFormat, ORCOutputFormat).
Rather old news, but I struggled with this some time ago. I did not found any solution for this so, as a result, I've made a set of input/output formats that convert avro and parquet files to/from plain text and json. It can be found at http://github.com/whale2/iow-hadoop-streaming. There's no ORC support, but Avro and Parquet are supported.
Hope this helps.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Any one know how to do Apache hive data visualization using D3js??
Use Hive Query like
SELECT b.blogID, count(b.name) FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'name') b
AS blogID, name group by b.blogID;
and make as Json_tuple then you can easily use That json to D3js
try it using sqoop..
using sqoop transfer data from hive to mysql or any other RDMS,den using python retreive the data to json file,
finally create d3js from the json file.