I am looking for truck GPS data. Could you please guide me on how I can download a free resource to download?
I need a dataset with at least 10000 records containing time, latitude, and longitude.
Related
What are the (relative) performance of the various Power BI Data Sources?
Specifically between SharePoint Online, Azure Blob and Azure Data Lake
We're looking at pushing some data into one of these for consumption by Power BI
As these are classes as file sources you will be limited to importing data and to a 1GB dataset sizes, and a refresh frequency of 8 times a day
It will depend on the volume and type, if it is csv files, there is nothing much between Blob and Datalake, it will do a base 1 x 1GB in about 5-8 minutes. That will be for a base read of the data, without any transformations. For multiple files, it will depend on the number
For SharePoint, will it be a list, or documents in a library? After testing about 30,000 items in a list can take about 20-30mins, but again it will depend on the structure, for example how wide it is.
If you are pushing data into something and it is a known structure, use an Azure SQL Database, then you can use direct query, so the data is always up to date.
I am quite new to « Big Data » technologies, especially Cassandra, so I need your advices for the task I have to do. I have been looking to Datastax examples about handling timeseries, and different discussion here about this topic, but if you think I might have missed something, feel free to tell me.
Here it my problem.
I need to store and analyze data coming from about 100 sensor stations that we are testing. In each sensor station, we have several thousand sensors. So for each station, we run several tests (about 10, each one lasting about 2h30), during which the sensors are recording information every millisecond (can be boolean, integer or float). The records of each test are kept on the station during the test, then they are sent to me once the test is completed. It means about 10 GB for each test (each parameter is about 1 MB of information).
Here is a schema to illustrate the hierarchy:
Hierarchy description
Right now, I have access to a small Hadoop Cluster with Spark and Cassandra for testing. I may be able to install other tools, but I would really appreciate to keep working with Spark/Cassandra.
My question is: what could be the best data model for storing then analyzing the information coming from these sensors?
By “analyzing”, I mean:
find min, max, average value on a specific parameter recorded by a specific sensor on a specific station; or find those values for a specific parameter but for all the station; or find those value for a specific parameter but when other parameters (one or two) of the same station are upper than a limit
plot the evolution of one or more parameters to compare them visually (the same parameter on different stations, or different parameters on the same station)
do some correlation analysis between parameters or stations (eg. to find if a sensor is not working).
I was thinking of putting all the information in a Cassandra Table with the following data model:
CREATE TABLE data_stations (
station text, // station ID
test int, // test ID
parameter text, // name of recorded parameter/sensor
tps timestamp, // timestamp
val float, // measured value
PRIMARY KEY ((station, test, parameter), tps)
);
However, I don’t know if one table would be able to handle all the data : a quick calculation give 10^14 different rows according to the precedent data model (100 stations x 10 test x 10 000 parameters x 9,000,000ms (2h30 in milliseconds) ~= 10^14), even if each partition is “only” 9,000,000 rows.
Other ideas were to split the data in different table (eg. One table per station, or one table per test per station, etc.). I don’t know what and how to choose, so any advice is welcome!
Thank you very much for your time and help, if you need more information or details I would be glad to tell you more.
Piar
You are on the right track, Cassandra can handle such data. You may store all the data you want it column families and use Apache Spark over Cassandra to do the required aggregations.
I feel Apache Spark is good for your use case as it could be used for aggregations and calculating correlations.
You may also check out Apache Hive as it can work/query over data in HDFS directly(through external tables).
Check these :
Cassandra - Max. size of wide rows?
Limitations of Cassandra
I want to know the difference between Hadoop batch analytics and Hadoop real time analytics.
E.g Hadoop real time analytics can be done using Apache Spark while Hadoop batch analytics can be done using Map reduce programming.
Also if real time analytics is the more preferred one then what is batch analytics required for?
thanks
Batch means you process aaaaaaall data you have collected so far. Real-time means you process data as it enters the system. Neither one is "preferred".
Let me explain use cases for Batch processing & Real processing.
Batch processing:
In stock market application, you have requirement to provide below summary data on daily basis
For each stock, total number of buy orders and sum of all buy orders
For each stock, total number of sell order and sum of all sell orders
For each stock, total number of successful orders & failed orders
etc.
Here need 24 hours of stock market data to generate these reports.
** Weather application: **
Save weather reports of all places in the world for all countries. For a given place like Newyork or Country like America, find hottest and coldest day since 1900. This query requires huge input data sets which requires processing on thousands of noudes.
You can use Hadoop Map Reduce job to provide above summary. You may have to process Peta bytes of data, which is stored in 4000+ servers in Hadoop cluster.
Real time analytics:
Another use case, you logged into social networking site like facebook or twitter. Your friends posted a message on your wall or tweeted in twitter. You have to get these notification in real time.
When you visit sites like Booking.com to book a hotel, You will get real time notifications like X users are currently viewing this hotel etc. These notifications are generated in real time.
In above use cases, system should process streams of data and generate real time notifications to users instead of waiting for one day data. Spark streaming provides excellent support to handle these type of scenarios.
Spark uses in - memory processing for faster query execution but it's not possible to always use in - memory for peta bytes of data. Spark can process terabytes of data and Hadoop can process Peta bytes of data.
Hadoop batch analytics and real time analytics both are totally different, It depends on your use case what you want, example - you have a large volume of row dataset and you want to extract only few information from that dataset, information may be based on some calculation/trending etc. than this can be done with batch processing like finding a minimum temperature since last 50 years.
Whereas real time analytics, means you need the expected output ASAP like your friend tweeted on twitter and you get the tweets as soon as tweeted by your friend.
Batch data processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time. Data is collected, entered, processed and then the batch results are produced (Hadoop focused on batch data processing). Batch processing requires separate programs for input, process and output. An example is payroll and billing systems.
In contrast, real time data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real time). Radar systems, customer services and bank ATMs are examples.
I have a project that will generate a huge number of images. (1 000 000 sorry i erred)
I need to process every image through algorithm.
Can you advice me an archinecture for this project?
It is proprietary algorithm in the area of computer vision.
Average size of image is near 20 kB
I need to process them when they are uploaded and 1 or 2 times on request.
On average, once a day I get a million images, each of which I will need to navigate through the algorithm 1-2 times per day.
Yep most often, the image will be stored on a local disk
When i process images i will generate new image.
Current view:
Most likely i will have a few servers (i do not own) for each of the servers i have to perform the procedure described above.
Internet bandwidth between servers is very thin (about 1 Mb \ s) but for me it is necessary to exchange messages between servers (update coefficients of the neural network) and update algorithm.
On current hardware (intel family 6 model 26) it is about 10 minutes to complete full procedure for 50 000 images.
May be
Where will be wide internel channels so i can upload this images to servers i have.
Dont know much about images. But i guess this should help http://www.cloudera.com/content/cloudera/en/why-cloudera/hadoop-and-big-data.html
Also please let us know what kind of processing you are talking about and when you are saying huge number of images. How much do you expect per hour or per day ?
How can i find all available addresses from Bing Map service for around 100 meters of my Gps location. i have implemented functionality for getting my GPS location and Address searching.
Or if i say that i want to get the 4 nearest addresses from my GPS location within 100 meters.
We need to take help from some Web Service which provides nearby locations services.