IoT Analytics dataset contains only __dt. Is my query broken? - aws-iot-analytics

I've been beating my head against this for a while. I have created a channel, pipeline, datastore and dataset, but the dataset just contains __dt no matter what I do.
I believe the channel, pipeline, and datastore are working, primarily because I see correctly formatted JSON messages in the S3 bucket for the datastore.
My datastore is called "salt_datastore". When I navigate to the relevant S3 bucket, I see a folder called "salt_datastore", and in it, I see a folder with today's date called "__dt=2022-10-09 00:00:00/". Inside that folder, I see a separate .gz file for every message I have sent, with names of the format "1665276480000_1665276510000_435011638936_salt_sensor_0_840.0.salt_sensor_pipeline.json.gz". If I download and open one of these, I see the MQTT messages that were sent to the MQTT topic.
So I think the channel, pipeline, and datastore are working, but if I set up a dataset with the query "select * from salt_datastore", I only get "__dt". I feel like this is the starting text of the folder inside the salt_datastore S3 bucket, but I can't figure out how to construct a valid SQL query that gives me what's inside that folder. Any help?

I had the same issue.
DELETED INCORRECT ANSWER
Edited
Previously I incorrectly said to put the name of the datastore in single quotes in the sql query.
But it was the name of the topic that I had to put in single quotes. This is actually in your sql statement in the IoT Core rule
NOT
the sql statement for the datastore.
So the sql for the topic should look like
SELECT * FROM 'some_topic'
That is, the topic that that you are publishing to.
Adding query images
Topic query
Dataset query

Related

Data integration for Magento to Quick Book

I'm currently new to Talend and I'm learning through videos and documentation, so I'm just not sure how to approach/implement this with best practices.
Goal
Integrate Magento and Quick Book using Talend.
My thoughts
Initially my first thought was I will setup direct DB connection for Magento and will take relevant data which I need and will process it and will send to QuickBook using REST API's(specifically bulk API's in batch)
But then again I thought it would be little hectic for me to query Magento database(multiple joins) so I've another option to use Magento's REST API.
But as I'm not much familiar with the tool I'm struggling little to find best suitable approach, so any help is appreciated.
What I've done till now?
I've saved my auth(for QB) and db(Magento) credentials data in file and using tFileInputDelimited and tContextLoad, I'm storing them in context variables so they can be accessible globally.
I've successfully configured database connection and dbinput but I've not used metadata for connection(should I use that and if Yes how can I pass dynamic values there?). I've used my context variables data in db connection settings.
I've taken relevant fields for now but if I want multiple fields simple query is not enough as Magento stores data in multiple tables for Customer etc but it's not big deal I know but I think it might increase my work.
For now that's what I've built and my next step is send the data to QB using REST while getting access_token and saving it to context variable and again storing the QB reference into Magento DB.
Also I've decided to use QB bulk API's but I'm not sure how I can process data in chunks in Talend(I tried to check multiple resources but no luck) i.e. if the Magento is returning 500 rows I want to process them in chunks of 30 as QB batch max limit is 30, so I will be sending it using REST to QB and as I said I also want to store back QB reference ID in magento(so I can update it later).
Also this all will be on local, then how can I do same in production? how I can maintain development and production environment?
Resources I'm referring
For REST and Auth best practices - https://community.talend.com/t5/How-Tos-and-Best-Practices/Using-OAuth-2-0-with-Talend-to-Access-Goo...
Nice example for batch processing here:
https://community.talend.com/t5/Design-and-Development/Batch-processing-in-talend-job/td-p/51952
Redirect your input to a tFileOutputDelimited.
Enter the output filename, tick the option "Split output in several files" from the "Advanced settings" and enter the value of 1000 into the field "Rows in each output file". This will create n files based on the filename with 1000 in each.
On the next subjob, use a tFileList to iterate over this file list to get records from each file.

Enumerate all files in a container and copy each file in a foreach

Need to load all .csv files in Azure Blob Container into SQL database.
Tried using a wild card *.* on the filename in the dataset which uses the linked service that connects to the blob and outputting the itemName in the Get Meta Data activity.
When executing in debug a list of filenames is not returned in the Output window. When referencing the parameter with an expression it is stated that the type is String not collection.
For this kind of task, I use the Get Metadata activity and process the results with a For Each activity. Inside the For Each activity, you can have a simple Copy activity to copy the CSV files to SQL tables, or if the work is more complex you can use Data Flow.
Some useful tips:
In the Get Metadata activity, under the DataSet tab > "Field list", select the "Child Items" option:
I recommend adding a "Filter" activity after the Get Metadata to ensure that you are only processing files, and optionally even expected extensions. You do this in the Settings tab like so:
In the For Each activity, on the Settings tab, set the Items based on the output of the Filter activity:
Inside the For Each, at the activity level, you reference the instance by "#item().name":
Here's what one of my production pipelines that implements this pattern looks like:
For you needs, you could get an idea of LookUp Activity.
Lookup activity can retrieve a dataset from any of the Azure Data
Factory-supported data sources. Use it in the following scenario:
Dynamically determine which objects to operate on in a subsequent
activity, instead of hard coding the object name. Some object examples
are files and tables.
For example, my container has 2 csv files in test container:
Configure a blob storage dataset :
Configure Lookup Activity:
Output:
Are you actually using *.*? That may be too vague for the system to interpret. Maybe you can try something a little more descriptive, like this, ???20190101.json, or whatever matches the patters of your data-sets.
I encountered a weird problem a couple weeks ago, whereby I was using the wildcard character to iterate through a bunch of files. I was always starting on row 3 and as luck would have it, some files didn't have a row 3. A handful of files had metadata in row 1, field names in row 2, and no row 3. So, I was getting some weird errors. I changed the line start to be row 2 and everything worked fine after that.
Also, check out the link below.
https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/

How to parameterise the data connection in Tableau in AWS (cloudformation or otherwise)?

I have a simple web app UI (which stores certain dataset parameters (for simplicity, assuming they are all data tables in a single Redshift database, but the schema/table name can vary, and the Redshift is in AWS). Tableau is installed on an EC2 instance in the same AWS account.
I am trying to determine an automated way of passing 'parameters' as a data source (i.e. within the connection string inside Tableau on EC2/AWS) rather than manually creating data source connections and inputting the various customer requests.
The flow for the user would be say 50 users select various parameters on the UI (for simplicity suppose the parameters are stored as a JSON file in AWS) -> parameters are sent to Tableau and data sources created -> connection is established within Tableau without the customer 'seeing' anything in the back end -> customer is able to play with the data in Tableau and create tables and charts accordingly.
How may I do this at least through a batch job or cloud formation setup? A "hacky" solution is fine.
Bonus: if the above is doable in real-time across multiple users that would be awesome.
** I am open to using other dashboard UI tools which solve this problem e.g. QuickSight **
After installing Tableau on EC2 I am facing issues in finding an article/documentation of how to pass parameters into the connection string itself and/or even parameterise manually.
An example could be customer1 selects "public_schema.dataset_currentdata" and "public_scema.dataset_yesterday" and one customer selects "other_schema.dataser_currentdata" all of which exist in a single database.
3 data sources should be generated (one for each above) but only the data sources selected should be open to the customer that selected it i.e. customer2 should only see the connection for other_schema.dataset_currentdata.
One hack I was thinking is to spin up a cloud formation template with Tableau installed for a customer when they make a request, creating the connection accordingly, and when they are done then just delete the cloud formation template. I am mainly unsure how I would get the connection established though i.e. pass in the parameters. I am not sure spinning up 50 EC2's though is wise. :D
An issue I have seen so far is creating a manual extract limits the number of rows. Therefore I think I need a live connection per customer request. Hence I am trying to get around this issue.
You can do this with a combination of a basic embed and applying filters. This would load the Tableau workbook. Then you would apply a filter based on whatever values your user selects from the JSON.
The final missing part is that you would use a parameter instead of a filter and pass those values to the database via initial sql.

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

Realistic Data Backup method for Parse.com

We are building an iOS app with Parse.com, but still can't figure out the right way to backup data efficiently.
As a premise, we have and will have a LOT of data store rows.
Say we have a class with 1million rows, assume we have it backed up, then want to bring it back to Parse, after a hazardous situation (like data loss on production).
The few solutions we have considered are the following:
1) Use external server for backup
BackUp:
- use the REST API to constantly back up data to a remote MySQL server (we chose MySQL for customized analytics purpose, since it's way faster and easier to handle data with MySQL for us)
ImportBack:
a) - recreate JSON objects from MySQL backup and use the REST API to send back to Parse.
Say we use the batch operation which permits 50 simultaneous objects to be created with 1 query, and assume it takes 1 sec for every query, 1million data sets will take 5.5hours to transfer to Parse.
b) - recreate one JSON file from MySQL backup and use the Dashboard to import data manually.
We just tried with 700,000 records file with this method: it took about 2 hours for the loading indicator to stop and show the number of rows in the left pane, but now it never opens in the right pane (it says "operation time out") and it's over 6hours since the upload started.
So we can't rely on 1.b, and 1.a seems to take too long to recover from a disaster (if we have 10 million records, it'll be like 55 hours = 2.2 days).
Now we are thinking about the following:
2) Constantly replicate data to another app
Create the following in Parse:
- Production App: A
- Replication App: B
So while A is in production, every single query will be duplicated to B (using background job constantly).
The downside is of course that it'll eat up the burst limit of A as it'll simply double the amount of query. So not ideal thinking of scaling up.
What we want is something like AWS RDS which gives an option to automatically backup daily.
I wonder how this could be difficult for Parse since it's based on AWS infra.
Please let me know if you have any idea on this, will be happy to share know-hows.
P.S.:
We’ve noticed an important flaw in the above 2) idea.
If we replicate using REST API, all the objectIds of all Classes will be changed, so every 1to1 or 1toMany relations will be broken.
So we think about putting a uuid for every object class.
Is there any problem about this method?
One thing we want to achieve is
query.include(“ObjectName”)
( or in Obj-C “includeKey”),
but I suppose that won’t be possible if we don’t base our app logic on objectId.
Looking for a work around for this issue;
but will uuid-based management be functional under Parse’s Datastore logic?
Parse has never lost production data. While we don't currently offer automated backups, you can request one any time you like, and we're working on making all of this even nicer. Additionally, it's easier in most cases to import the JSON export file through the data browser rather than using the REST batch.
I can confirm that today, Parse did lost my data. Or at least it appeared to be so.
After several errors where detected on multiple apps (agreed by Parse Status twitter account), we could not retrieve data for an app, without any error.
It was because an entire column of one of our class (type pointer) disappeared and data was not present anymore in the dashboard.
We are using this pointer column to filter / retrieve data, so the returned queries and collections were empty.
So we decided to recreate the column manually. By chance, recreating the column, with the same name and type, solved the issue and the data was still there... I can't explain it but I really thought, and the app reacted as if, data were lost.
So an automated backup and restore option is mandatory, it is not an option.
On December 2015 parse.com released a new dashboard with an improved export feature.
Just select your app, click on "App Settings" -> "General" -> "Export app data". Parse generates a json-file for every class in your app and sends an email to you, if the export-progress is done.
UPDATE:
Sad but true, parse.com is winding down: http://blog.parse.com/announcements/moving-on/
I had the same issue of backing up parse server data. As parse server is using mongodb that is why backing up data is not an issue I have just done a simple thing. downloaded the mongodb backup from the server. And then restored it using
mongorestore /path-to-mongodump (extracted files)
As parse has been turned to open source.Therefore we can adopt this technique.
For accidental deletes, writing a cloud function 'beforedelete' to backup the current row to another class would work.
For regular backups, manual export of changed records (use filter) will be useful. For recovery this requires you to write scripts / use import option (not so sure) in data browser. You could also write a cloud function replicate data on your backup server (haven't tried this yet).
However there are some limitations to cloud code that you should consider before venturing into it:
https://parse.com/docs/cloud_code_guide#functions-resource

Resources