I have lot of input data from microsoftcteam, how could merge or collect all of zhem in s data set? with Amazon consol? - data-collection

I have lots of input data from microsoftcteam, it is about recorded meeting, how could merge or collect all of them in a data set? with Amazon consol?

Related

Choosing between DynamoDB Streams vs. Kinesis Streams for IoT Sensor data

I have a fleet of 250 Wifi-enabled IoT sensors streaming weight data. Each devices samples once per second. I am requesting help between choosing AWS DynamoDB Streams vs. AWS Kinesis Streams to to store and process this data in real-time. Here are some additional requirements:
I need to keep all raw data in a SQL-accessible table.
I also need to clean the raw stream data with Python's Pandas library to recognize device-level events based on weight changes (e.g. if weight of sensor #1 increases, record as "sensor #1 increased by x lbs # XX:XX PM" If no change, do nothing).
I need that change-event data (interpreted with library from the raw data streams) to be accessible in real time dashboard (e.g. device #1 weight just went to zero, prompting employee to refill container #1)
Either DDB Streams or Kinesis Streams can support Lambda functions, which is what I'll use for the data cleaning, but I've read the documentation and comparison articles and can't distinguish which is best for my use case. Cost is not a key consideration. Thanks in advance!!
Unfortunately, I think you will need a few pieces of infrastructure for a full solution.
I think you could use Kinesis and firehose to write to a database to store the raw data in a way that can be queried with SQL.
For the data cleaning step, I think you will need to use a stateful stream processor like flink or bytewax and then the transformed data can be written to a real-time database or back to kinesis so that it can be consumed in a dashboard.
DynamoDB stream works with DynamoDB. It streams row changes to be picked up by downstream services like Lambda. You mentioned that you want data to be stored in SQL data base. DynamoDB is a NOSQL databse. So you can exclude that service.
Not sure why you want to have data in SQL database. If it is timeseries data, you would probably store them into a time series db like TimeStream.
If you are using AWS IoT Core to send data over MQTT to AWS, you can forward those messages to Kinesis Data Stream (or SQS). Then you can have a lambda triggered on messages received in Kinesis. This lambda can process the data and store them in the DB you want.

How to use AWS Kinesis streams for multiple different data sources

We have a traditional batch application where we ingest data from multiple sources (Oracle, Salesforce, FTP Files, Web Logs etc.). We store the incoming data in S3 bucket and run Spark on EMR to process data and load on S3 and Redshift.
Now we are thinking of making this application near real time by bringing in AWS Kinesis and then using Spark Structured Streaming from EMR to process streaming data and load it to S3 and Redshift. Given that we have different variety of data e.g. 100+ tables from Oracle, 100+ salesforce objects, 20+ files coming from FTP location, Web Logs etc. what is the best way to use AWS Kinesis here.
1) Using Separate Stream for each source (Salesforce, Oracle, FTP) and then using a separate shard (within a stream) for each table/ object - Each consumer reads from its own shard which has a particular table/ file
2) Using a separate stream for each table/ object - We will end up having 500+ streams in this scenario.
3) Using a single stream for everything - not sure how the consumer app will read data in this scenario.
Kinesis does not care what data you put into a stream, data is just a blob to Kinesis. It will be up to you to determine (code) the writers and readers for a stream. You could intermix different types of data into one stream, the consumer will then need to figure out what each blob is and what to do with it.
I would break this into multiple streams based upon data type and priority of the data. This will make implementation and debugging a lot easier.
I think you are misunderstanding what shards are. They are for performance and not for data separation.

Cassandra for datawarehouse

Is Cassandra a good alternative for Hadoop as a data warehouse where data is append only and all updates in source databases should not overwrite the existing rows in the data warehouse but get appended. Is Cassandra really ment to act as a data warehouse or just as a database to store the results of batch / stream queries?
Cassandra can be used both as a data warehouse(raw data storage) and as a database (for final data storage). It depends more on the cases you want to do with the data.
You even may need to have both Hadoop and Cassandra for different purposes.
Assume, you need to gather and process data from multiple mobile devices and provide some complex aggregation report to the user.
So at first, you need to save data as fast as possible (as new portions appear very often) so you use Cassandra here. As Cassandra is limited in aggregation features, you load data into HDFS and do some processing via HQL scripts (assume, you're not very good at coding but great in complicated SQLs). And then you move the report results from HDFS to Cassandra in a dedicated reports table partitioned by user id.
So when the user wants to have some aggregation report about his activity in the last month, the application takes the id of active user and returns the aggregated result from Cassandra (as it is simple key-value search).
So for your question, yes, it could be an alternative, but the selection strategy depends on the data types and your application business cases.
You can read more information about usage of Cassandra
here

Map Reduce - How to plan the data files

I would like to use AWS EMR to query large log files that I will write to S3. I can design the files any way I like. The data is created in a rate of 10K entries/minute.
The logs consist of dozens of data points and I'd like to collect data for very long period of time (years) to compare trends etc.
What are the best practices for creating such files that will be stored on S3 and queried by AWS EMR cluster?
Whats the optimal file sizes ?Should I create separate files for example on hourly basis?
What is the best way to name the files?
Should I place them in daily/hourly buckets or all in the same bucket?
Whats the best way to handle things like adding some data after a while or change in data structure that I use?
Should I compress things for example by leaving out domain names out of urls or keep as much data as possible?
Is there any concept like partitioning (the data is based on 100s of websites so I can use site ids for example). I must be able to query all the data together, or by partitions.
Thanks!
in my opinion you should use a hourly basis bucket to store data in s3 and then use a pipeline to schedule your mr job to clean the data.
once u have clean the data you can keep it to a location in s3 and then you can run a data pipeline on hourly basis on the lag of 1hour with respect to your MR pipeline to put this process data into redshift.
Hence at 3am on a day you will have 3 hour of processed data in s3 and 2 hour processed into redshift dB.
To do this you can have 1 machine dedicated for running pipelines and on that machine you can define you shell script/perl/python or so script to load data to your dB.
You can use AWS bucketing formatter for year,month,date,hour and so on. for e.g.
{format(minusHours(#scheduledStartTime,2),'YYYY')}/mm=#{format(minusHours(#scheduledStartTime,2),'MM')}/dd=#{format(minusHours(#scheduledStartTime,2),'dd')}/hh=#{format(minusHours(#scheduledStartTime,2),'HH')}/*

How to do complex query on big data?

every one.
I have some data about 6G in hdfs that has been exported from mysql.And I have write mapreduces prehandling data to fill some key field that data can be easily queried.
As the business demands are different aggregation data group by day ,hour,hospital,area etc,
so I have to write many hive sqls exporting data to local disk,and then I write python script to parse files on local disk ,then get datas in demand.
Is there some good technique on hadoop to resolve my demand.I am considering.
Can you help me ,please.

Resources