I need to replicate, in a new database, the result of a view.
Is there a solution other than using an external process waiting for changes?
Below a broken example of what i need: a continuos filtered replica using a reduced view as source:
curl -H "Content-Type: application/json" -X POST -d \
'{"source":"http://localhost:5984/datastream/_design/dbname/_view/viewname?group=true&group_level=3", \
"target":"http://localhost:5984/dbreplica", "filter":"dbname/filtername", \
"query_params": {"key":"value"}, "continuous":true}' http://localhost:5984/_replicate
CouchDB supports replicating from one database to another. It is not possible to replicate or copy a view to a database.
However BigCouch, custom couchdb build made by Cloudant with built-in clustering capability, has a feature called chained-map-reduce views. It allows the rows from a map-reduce view to be copied to another database. This is exactly what you are trying to do.
Related
We are currently using SQL Server in AWS. We are looking at ways to create a data warehouse from that data in SQL Server.
It seems like the easiest way was to use AWS DMS tool and send data to redshift having it constantly sync. Redshift is pretty expensive so looking at other ways of doing it.
I have been working with EMR. Currently I am using sqoop to take data from SQL Server and put it into Hive. I am currently use the HDFS volume to store data. I have not used S3 yet for that.
Our database has many tables with millions of rows in each.
What is the best way to update this data everyday? Does sqoop support updating data. If not what other tool is used for something like this.
Any help would be great.
My suggestion you can go for Hadoop clusters(EMR) if the processing is too complex and time taken process or better to use Redshift.
Choose the right tool. If it is for the data warehouse then go for Redshift.
And why DMS? are you going to sync in real-time? You want the daily sync. So no need to use DMS.
Better solution:
Make sure you have a primary key column and column that tell us when the row gets updated like updated_at or modified_at.
Run BCP to export the data in bulk from SQL Server to CSV files.
Upload the CSV to S3 then import to RedShift.
Use glue to fetch the incremental data (based on the primary key column and the update_at column) then export it to S3.
Import files from S3 to RedShift staging tables.
Run upsert command (update + insert) to merge the staging table with the main table.
If you feel running the glue is a bit expensive, then use SSIS or Powershell script to do steps 1 to 4. Then psql command to import files from S3 to Redshift and do steps 5 and 6.
This will handle the Insert and updates in your SQL server tables. But the deletes will not be a part of it. If you need all CRUD operations then go for the CDC method with DMS or Debezium. Then push it to S3 and RedShift.
I'm trying to upload a CSV file grabbed from a SFTP server to Vertica as a new table. I got the GetSFTP processor configured - but I can't seem to understand how do I set up the connection with Vertica and execute the SQL?
1 - You need to setup a DBCPConnectionPool with your Vertica JAR(s) like #mattyb mentioned.
2 - Create a Staging Area where you will have your Executable(copy Scripts)
3 - Create a template to manage your Scripts or loads(ReplaceText Processor)
Note:
the parameters you see here come in the flow file from upstream proccesors.
this is reusable process group so there are many other PG`s that will have their output sent to this.
Example:
data_feed task will run a Start Data Feed (this PG will hold it`s own parameters and values) - if is executing with no error comes to this step, is it fail it goes to another reusable PG that handles Errors.
daily ingest process (Trickle load every 5 min), - a PG will prepare the CSV file, move it to staging, makes sure is all in the right format,- if is executing with no error comes to this step, is it fail it goes to another reusable PG that handles Errors.
And so on many PG`s will use this a Reusable PG to load Data in the DB
PG - Stand for Process Group
this is how mine looks
./home/dbadmin/.profile /opt/vertica/bin/vsql -U $username -w
$password -d analytics -c " copy ${TableSchema}.${TableToLoad} FROM
'${folder}/*.csv' delimiter '|' enclosed by '~' null as ' ' STREAM
NAME '${TableToLoad} ${TaskType}' REJECTED DATA AS TABLE
${TableSchema}.${TableToLoad}_Rejects; select
analyze_statistics('${TableSchema}.${TableToLoad}');"
-you can add you param as well or create new once
4 - Update Attribute Proc so you can name the executable.
5 - Putfile proc that will place the Vertica Load Script on the machine.
6 - ExecuteStreamComnand - this will run the shell script.
- audit logs and any other stuff can be done here.
Even Better - see the attached Template with of a reusable PG i use for me data loads into Vertica with NIFI.
http://www.aodba.com/bulk-load-data-vertica-apache-nifi/
As for the Vertica DBCP the setup should look like this:
where the red mark is you ipaddress:port/dbadmin
Note:
This DBCPConnectionPool can be at the project level (inside a PG) or a the NIFI level (create it in the main canvas using the Controller Services Menu)
Besides the bulk loader idea from #sKwa , you can also create a DBCPConnectionPool with your Vertica JAR(s) and a PutSQL processor that will execute the SQL. If you need to convert from data to SQL you can use ConvertJSONToSQL, otherwise use PutDatabaseRecord which is basically a "ConvertXToSQL -> PutSQL" together.
Let's say I want to build a Spark application that I want to be able to kill part way through. I still want to persist the data from the partitions that finish successfully. I attempt to do so by inserting it into a Hive table. In (PySpark) pseudocode:
def myExpensiveProcess(x):
...
udfDoExpensiveThing = udf(myExpensiveProcess, StringType())
myDataFrame \
.repartition(100) \
.withColumn("HardEarnedContent", udfDoExpensiveThing("InputColumn")) \
.write.insertInto("SomeExistingHiveTable")
I run this until 30 partitions are done, then I kill the job. When I check SomeExistingHiveTable, I see that it has no new rows.
How do I persist the data that finishes, irrespective of which ones did not?
This is expected and desired behavior ensuring consistency of the output.
Write data directly to file system bypassing Spark's data source API.
myDataFrame \
.repartition(100) \
.withColumn("HardEarnedContent", udfDoExpensiveThing("InputColumn")) \
.rdd \
.foreachPartition(write_to_storage)
where write_to_storage implements required logic, for example using one of the HDFS interfaces.
I am a new hadoop developer and I have been able to install and run hadoop services in a single-node cluster. The problem comes during data visualization. What purpose does MapReduce jar file play when I need to use a data visualization tool like Tableau. I have a structured data source in which I need to add a layer of logic so that the data could make sense during visualization. Do I need to write MapReduce programs if I am going to visualize with other tools? Please shed some light on how I could go about on this issue.
This probably depends on what distribution of Hadoop you are using and which tools are present. It also depends on the actual data preparation task.
If you don't want to actually write map-reduce or spark code yourself you could try SQL-like queries using Hive (which translates to map-reduce) or the even faster Impala. Using SQL you can create tabular data (hive tables) which can easily be consumed. Tableau has connectors for both of them that automatically translate your tableau configurations/requests to Hive/Impala. I would recommend connecting with Impala because of its speed.
If you need to do work that requires more programming or where SQL just isn't enough you could try Pig. Pig is a high level scripting language that compiles to map-reduce code. You can try all of the above in their respective editor in Hue or from CLI.
If you feel like all of the above still don't fit your use case I would suggest writing map-reduce or spark code. Spark does not need to be written in Java only and has the advantage of being generally faster.
Most tools can integrate with hive tables meaning you don't need to rewrite code. If a tool does not provide this you can make CSV extracts from the hive tables or you can keep the tables stored as CSV/TSV. You can then import these files in your visualization tool.
The existing answer already touches on this but is a bit broad, so I decided to focus on the key part:
Typical steps for data visualisation
Do the complex calculations using any hadoop tool that you like
Offer the output in a (hive) table
Pull the data into the memory of the visualisation tool (e.g. Tableau), for instance using JDBC
If the data is too big to be pulled into memory, you could pull it into a normal SQL database instead and work on that directly from your visualisation tool. (If you work directly on hive, you will go crazy as the simplest queries take 30+ seconds.)
In case it is not possible/desirable to connect your visualisation tool for some reason, the workaround would be to dump output files, for instance as CSV, and then load these into the visualisation tool.
Check out some end to end solutions for data visualization.
For example like Metatron Discovery, it uses druid as their OLAP engine. So you just link your hadoop with Druid and then you can manage and visualize your hadoop data accordingly. This is an open source so that you also can see the code inside it.
I have a very very large quantity of data in CouchDB, but I have very recently found out how crippled the mapreduce functions in couch are (no chaining).
So I had this idea of running map reduce queries from the CouchDB database using Hadoop, and hopefully storing the final result in another CouchDB database?
Is this too crazy? I know I can set up Hbase to do this, but I do not want to migrate my data from CouchDB to Hbase. And I love couch as a data store.
Apparently CouchDB is supposed to be able to stream data to Hadoop via Sqoop, but I didn't see any other information than that link. Worst case, you can write your own input reader to read from CouchDB, or export your data regularly and throw it onto HDFS and run it from there.
The MapReduce functions in CouchDB are constrained to simplify caching of the results. Rather than having to search for views that are impacted by a change, views were designed to be self-contained.
This means that if you have complex MapReduce code, you can use a tool like CouchApp to embed functions within a MapReduce function. I'm having trouble finding the reference for this, but you the macro !code to embed JavaScript functions in views. Using require() or // !json, !code in CouchDB?
This could help to get some of the productivity benefit of chaining without chaining, by putting most of the code in shared functions, and merely calling the function in the different views. For the performance benefit of chaining, if that's what you're after, you may be better off just moving to HBase.