big data - where does the data come from? [closed] - hadoop

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
This might seem like an inane question but with all the buzz about big data I was curious as to how the typical datasets used in big data are sourced? Twitter keywords seem to be a common source - but what are the origins of the huge twitter feed files that get analysed? I saw an example where there was an analysis of election related words like Obama and Romney..has someone queried the Twitter API and downloaded effectively several terabytes of Tweets? Does Twitter even want people hitting their servers that hard? Or is this data already 'owned' by the companies doing the analytics. It might sound an odd scenario but most of the articles I have seen are fuzzy about these basic physical steps. Any links to good articles or tutorials that address these fundamental issues would be most appreciated

Here are some ideas to get sources of Big Data:
As you pointed Twitter is a great place to grab data and there's a lot of useful analysis to do. If you're taking the online course about Data Science one of the assignments is actually how to get live data from Twitter to analyze so I would recommend you take a look at this assignment as the process of getting live Twitter data is very detailed. You could let the live stream run for days and it would probably generate Gigabytes worth of data the longer it runs.
If you have a website you could get web server logs. It might not be a lot if it's a small website, but for large websites who see a lot of traffic this is a huge source of data. Think about what you could do if you had StackOverflow web server logs...
Oceanographic data which you can find at Marinexplore, they have some huge datasets available that you can download and analyze yourself if you want to analyze ocean data.
Web crawling data, for example used by search engines. You can see some open data coming from web crawl at Common Crawl which is already on Amazon S3 so ready to get your Hadoop jobs running on it ! You could also get data from Wikipedia here.
Genomic data is now available on a very large scale and you can find genome data on the 1000 genomes project via FTP.
...
More generally I would advise you look at Amazon AWS datasets which has a bunch of big datasets on various topics if you're not just looking at Twitter but Big Data in a more general context.

Most businesses get their social data from Twitter Certified data partners such as Gnip.
Note: I work for Gnip.

Related

What is the difference between Elasticsearch, Apache Druid, and Rockset? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I'm working on an game application where I need real-time data for leaderboards I'm building. I've read a bunch of stackoverflows and company blogs- but honestly, I'm not sure which one best fits my use case. I am using DynamoDB to record players' recent moves, and the history of moves are in kafka. I am looking to stream data from these two data sources into a database and my leaderboard-service can then query the database to render the contents of each leaderboard. My data velocity is modest (1K game events/sec). I find these three different databases that I can use, has anybody used any of these database for game-leaderboarding? If so, can you share the advantages or pains that you have encountered while doing so? According to all 3 companies, they are able to do real-time data.
You would have to evaluate the scale and performance that you require and it is difficult for me to estimate those based on the data you provided. But I can do a feature comparison of using some of these systems.
The first option is to run your leaderboards by querying DynamoDB itself and you do not need any additional systems. The advantage obviously is that there is one less component for you to manage. But I am assuming that your leaderboards need complex logic to render, and because DynamoDB api deals with key/values, you have to fetch a lot of data from DynamoDB to execute every query to render the leaderboard.
The second option you specified is Elastic Search. Great system that gives query results really fast because it stores data as an inverted index. However, you wont be able to do JOINs between your DynamoDB data and kafka stream. But you sure can run a lot of concurrent queries on Elastic. I am assuming that you need concurrent queries because you are powering an online game where m
multiple players are accessing the leaderboard at the same time.
The third option, Druid, is a hybrid between a datalake and a data warehouse. You can store large volumes of semi-structured data, but unlike Elastic, you would need to flatten the nested json data an ingest time. I have used Druid for large scale analytical processing for powering my dashboards, and it does not support as high a concurrency as Elastic.
Rockset seems to be much newer product and is a hosted service on the cloud. It says that it build inverted index like Elastic and also supports JOINs. It can auto-tail data from DynamoDB (using change-streams) and kafka. I do not see any performance numbers on the website, but the functionality is very compatible with what I would need for building a game leaderboard.

Storing media in Enterprise Application [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Currently we use Oracle for storing images in the application. But we expect to see lot of images/videos in the application. We would like to move away from oracle to be able to shard easily and achieve high throughput. Any recommendations?
Did anyone try using NoSQL databases such as Couchbase/MongoDB for this purpose? Are they optimized for this purpose.
I see that Cloudinary uses Amazon S3 for this purpose. But I am looking for something, which can be deployed in our datacenter for privacy concerns.
From your problem description, I can't see any indication pro or contra a NoSQL database.
Having media like pictures, sound, or video, in a database means just having a large uninterpreted binary object. Uninterpreted means: The database can store and deliver the binary, but can't analyze it for its properties, take it as a basis for queries, and the like (what databases are made for).
Both relational and non-relational databases provide data types for that kind of BLOB. The features in which they differ are, for example,
tabular vs. tree structured data structures - not applicable for the BLOB, as it will be one attribute, no matter how large it becomes,
different sorts of transaction logic (CAP theorem) that aren't addressed by the BLOB subject matter.
So I'm afraid your architecture will need to be decided on a much broader range than just considering your media data. Which are your data structures? Which are your query and update scenarios?
What I see people do with Couchbase is store all of the meta-data about the image in a JSON document in Couchbase, but host the image itself is something optimized for files. You get the benefits of both worlds. In this kind of use case you mention, from my experience a NoSQL database will be much better than a relational database.
Having managed very large relational and NoSQL databases with blobs in them, IMO it is a terrible idea in most cases, regardless of the database type. So I wrote up this blog post for just such a situation.
As you are looking for private deployment in your data center, you may consider MongoDB or OpenStack Swift.
I have seen people using MongoDB gridfs (https://docs.mongodb.com/manual/core/gridfs/) for storing images/videos.
The advantages of using MongoDB gridfs:
You can use MongoDB replica set for fault tolerance/high availability.
You can access a portion of a large file without loading the whole file into memory. As MongoDB stores files into small chunks(255KB), so video files can be streamed faster.
You can scale using MongoDB sharding.
Openstack Swift is a highly available, distributed, eventually consistent object/blob store comparable to Amazon S3, which you can deploy in your data center.
Also OpenStack Swift is being used by many companies, Rackspace's Cloud Files runs Swift. You may also give a look into Swift :
http://docs.openstack.org/developer/swift/
S3 which has a very strong commitment to privacy. What are your concerns regarding S3? Also, which datacenter are you planning to move away from Oracle's storage?

High Performance DB for Fast Read and Fast Write. No Update or Delete [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am looking for the database/mechanism to store the data where I can write the data and read the data with high performance.
This storage is used to for storing the Logging like important information across multiple systems. Since it's critical data which will be logged, read performance should be pretty fast as these data will be used to show history. Since we never do update on them/delete on them/or do any kinda joins, I am looking for right solution. Probably we might archive the data in long time but that's something ok to deal with.
I tried looking at different sources to understand different NoSql databases, experts opinion is always better :)
Must Have:
1. Fast Read without fail
2. Fast Write without fail
3. Random access Performance
4. Replication kinda feature, one goes down, immediately another should be up and working
5. Concurrent write/read data
Good to Have:
1. Search content like analysing the data for auditing with/without Indexes
Don't required:
1. Transactions are not required at all
2. Update never happens
3. Delete never happens
4. Joins are not required
Referred: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Disclosure: Kevin Porter is a Senior Software Engineer at Aerospike, Inc. since May 2013. (ref)
Be sure to consider Aerospike; Aerospike dominates in the adtech space where high throughput reads and writes are a required. Aerospike is frequently touted as having "the speed of Redis with the scalability of Cassandra." For searching/querying see Aerospike's secondary index documentation.
For more information see the discussion/articles below:
Aerospike vs Cassandra
Aerospike vs Redis and Mongo
Aerospike Benchmarks
Lastly verify the performance for yourself with the One million TPS on EC2 Instructions.
Let me be the Cassandra sponsor.
Disclaimer: I don't say Cassandra is better than the others because I don't even know so deeply mongo/redis/whatever and I don't want even come into this kind of stuffs.
The reason why I suggest Cassandra is because your needs match perfectly with what Cassandra offers and your "don't required list" is a set of feature that are either not supported in Cassandra (joins for instances) or considered an anti-pattern (deletes and in some situations updates).
From your "Must Have" list, point by point
Fast Read without fail: Supported. You can choose the consistency level of each read operation deciding how much important is to retrieve the most fresh information and how much important is speed
Fast Write without fail: Same as point 1
Random access Performance: When coming in the Cassandra world you have to consider many parameters to get a random access performance but the most important that comes into my mind is the data model -- if you create a data model that scales horizontally (give a look here) and you avoid hotspots you get what you need. If you model your DB in a good way you should have O(1) for each operation since data are structured to be queried
Replication: In this Cassandra is even better than what you might think. If one node goes down nothing changes to the cluster and everything(*) keep working perfectly. Cassandra spots no single point of failure. I can tell you with older Cassandra version I've had an uptime of more than 3 years
Concurrent write/read data: Cassandra uses the lww policy (last-write-wins) to handle concurrent writes on the same key. The system supports multiple read-write and with newer protocols also async operations.
There are lots of other interesting features Cassandra offers: linear horizontal scaling is the one I appreciate more but there is also the fact that you can know the instant in which every piece of data has been updated (the timestamp of lww), counters features and so on.
(*) - if you don't use Consistency Level All which, imho, should NEVER be used in such a system.
Here's a few more links on how you can span In-Memory with Disk (DRAM, SSM, and disk storage) w/ Aerospike:
http://www.aerospike.com/hybrid-memory/
http://www.aerospike.com/docs/architecture/storage.html
I think everyone is right in terms of matching the specific DB to your specific use case. For instance, Aerospike is optimal for key-value data. Other options might be better.
By way of analogy, I'll always remember how, decades ago, a sister of mine once borrowed my computer and wrote her term paper in Microsoft Excel. Line after line was a different row of a spreadsheet. It looked ugly as heck, but, uh, okay. She got the task done. She cursed and swore at how difficult it was to edit the thing. No kidding!
Choosing the right NoSQL database for the right task will either make your job a breeze, or could cause you to curse a blue streak if you decided on the wrong basic tool for the task at hand.
Of course, every vendor's going to defend their product. I think it's best the community answer the question. Here's another Stack Overflow thread answering a similar question:
Has anyone worked with Aerospike? How does it compare to MongoDB?
btw: Do you have any more specific insights for us on what type of problem you are trying to solve?

How the "you might like these products" in webstores is implemented?

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 days ago.
Improve this question
Some e-commerce platforms have the suggestion feature where they tell you once you have an item in the basket that "you might like this product as well". Some, like Amazon, rely on the preexisting data on customer behaviour and their feature is called "Customers Who Bought This Item Also Bought" but some seem to suggest by other means.
What are these "other means"? What kind of algorithms do they use in webstores for this capability?
They use data mining, and this particular algorithm you're asking about is called the "nearest neighbor" algorithm.
Here's a link to an article I wrote on the algorithm (as well as many others).
http://www.ibm.com/developerworks/opensource/library/os-weka3/index.html
The process is called Business Intelligence, data will be stored in a data warehouse and the business intelligence process can be used using a product such as SSAS. The process will involve grouping the volumes of data (Who bought what and when) into data cubes. Analysis is performed on these cubes and used to compare your purchases with others who bought the same product, it will then recommend their purchases (Other customers who bought this, also bought this item....Item X). Other various AI algorithms are used to compare patterns across other customer trends such as how they shop, where they click etc. All this data is accumulated and then added to the data cube for analysis.
The data mining algorithms are outlined below, you could look for the Decision Tree Modelling algorithm which is how BI determines trends and patterns (In this case, Recommendations):
http://msdn.microsoft.com/en-us/library/ms175595.aspx
Majority of suggestions on e-commerce pages are created using some sort of a recommender system (http://en.wikipedia.org/wiki/Recommender_system). There are tools like Mahout (http://mahout.apache.org/) which already have implementation of most common approaches.
the best book about this kind of algorithms is: Programming Collective Intelligence
As some of the earlier folks answered, this is called recommendation engine. It is also referred to as Collaborative Filtering technique. There are few tools which does this, Mahout is one of them. Refer to the blog that I have written which talks about a use case where we use Mahout and Hadoop to build a recommendation engine. As a precursor to this, I have also written a Component architecture of how each of these fit together for a data mining problem.
Mahout will work in standalone mode and also with Hadoop. The decision to use either one really boils down to the size of the historical data that needs to be mined. If the data size is of the order of Terabytes and Petabytes, you typically use Mahout with Hadoop. Weka is another similar open source projects. All these come under a category called machine learning frameworks. I hope it help.

How search engine, say Google's page ranking algorithm work across distributed/multiple machines?

I am new to distributed computing but was wondering how page ranking algorithm works across multiple machines. Like
When do they decide data should be replicated (if needed at all),
If data is not copied, do they ask serves at other places to give them the result?
Or do they send "modules" to different serves (say part of a HUGE-HUGE - linked-graph) to one server, another module to another server and the combine the results they received?
I search something -- how does it fetches pages from my country (you know, search pages from <insert country> only)
This is not homework. Just a question I had. I welcome all ideas, even if they are very general or very detailed or do not answer all of my questions.
Right now, I know next to nothing, my hope is to know something after going through the answers.
There're three whales: MapReduce, Google File System, BigTable
Here are some whitepapers of the architecture
GoogleCluster
MapReduce, GFS, BigTable
Note: some of these are quite outdated, nowadays they are doing live updates, which wouldn't work with mapreduce.

Resources