A large and open source data set for research - hadoop

Please help me to find a massive data set for data mining research project.
It would be very helpful if you suggest me any search engines data (Google/yahoo user search history) or Wikipedia's user viewed statistics or twitter's user tweet data set.
i am working on hadoop framework and databases so for that i want millions of records in each table.

Here is million songs data set.
http://labrosa.ee.columbia.edu/millionsong/
If you want to extract tweets, I would suggest Twitter's Streaming API.
https://dev.twitter.com/streaming/overview

Related

Is there any difference in metrics when Querying the data using Eloqua API vs Getting a report from Eloqua Insights?

I am validating the data from Eloqua insights with the data I pulled using Eloqua API. There are some differences in the metrics.So, are there any issues when pulling the data using API vs .csv file using Eloqua Insights?
Absolutely, besides undocumented data discrepancies that might exist, Insights can aggregate, calculate, and expose various hidden relations between data in Eloqua that is not accessible by an API export definition.
Think of the api as the raw data with the ability to pick and choose fields and apply a general filter on those, but Insights/OBIEE as a way to calculate that data, create those relationships across tables of raw data, and then present it in a consumable manner to the end user. A user has little use with a 1 gigabyte csv of individual unsubscribes for the past year, but present that in several graphs on a dashboard with running totals, averages, and timeseries, and it suddenly becomes actionable.

Storing and processing timeseries with Hadoop

I would like to store a large amount of timeseries from devices. Also these timeseries have to be validated, can be modified by an operator and have to be exported to other systems. Holes in the timeseries must be found. Timeseries must be shown in the UI filtered by serialnumber and date range.
We have thought about using hadoop, hbase, opentsdb and spark for this scenario.
What do you think about it? Can Spark connect to opentsdb easily?
Thanks
OpenTSDB is really great for storing large amount of time series data. Internally, it is underpinned by HBase - which means that it had to find a way around HBase's limitations in order to perform well. As a result, the representation of time series is highly optimized and not easy to decode. AFAIK, there is no out-of-the-box connector that would allow to fetch data from OpenTSDB into Spark.
The following GitHub project might provide you with some guidance:
Achak1987's connector
If you are looking for libs that would help you with time series, have a look at spark-ts - it contains useful functions for missing data imputation as well.
Warp 10 offers the WarpScript language which can be used from Spark/Pig/Flink to manipulate time series and access data stored in Warp 10 via a Warp10InputFormat.
Warp 10 is Open Source and available at www.warp10.io
Disclaimer: I'm CTO of Cityzen Data, maker of Warp 10.
Take a look at Axibase Time Series Database which has a rather unique versioning feature to maintain a history of value changes for the same timestamp. Once enabled with per-metric granularity, the database keeps track of source, status and times of value modifications for audit trail or data reconciliation.
We have customers streaming data from Spark apps using Network API, typically once data is enriched with additional metadata (aks series tags) for downstream reporting.
You can query data from ATSD with REST API or SQL.
Disclaimer: I work for Axibase.

UI for querying large sets of non-cloud financial data

We have a large amount of financial data, stored locally, and cloud is not an option for us right now. We need to give a non-technical user, a way to run a few standard queries, with the results of those being stored in a file.
We can definitely write something in-house, a web page through which user enters the query and corresponding parameters, that creates a job, that queries the data and writes it to a file, and lets the user know when its done.
However, I feel there might be something that already performs similar tasks. is there a package/tech out there, that provides a UI for querying large sets of findata and dumps results into a file?
Where is your data stored? In a database or files? What kind of queries do you want to run? SQL?
From your description of "large data" and Web UI, my immediate thoughts are Hadoop HDFS for data storage and Hive for queries and maybe Hue for a web frontend.

How is social media data unstructured data?

I recently began reading up on big data, and how there are tools like hadoop or BigInsights that can manage both structured and unstructured data.
Social Media Analytics is something that can be done on BigInsights, and it takes unstructured data and analyzes/structures it accordingly.
This got me wondering, how is Social Media Data unstructured? For example, the information you can receive on tweets can be called using the Twitter REST API, and returned to you in a structured JSON format.
So isn't Social Media data already structured? If so why do you need a platform that manages mainly unstructured data?
Some make the distinction „semi-structured”, too.
But the point is the ability to query the data. Yes, Tweets etc. usually have some structure. But it's not helpful for analysis.
Given an ugly SQL schema, you could indeed run a query like
SELECT AVG(TweetID) FROM Twitter;
but that functionality is useless in practise. And that is probably why the data is best considered unstructured: you do not benefit from squeezing it into a relational schema.
Beware of buzzword bingo with big data, though. More often than not „supports unstructured data” actually means „does not benefit from structure in your data (by using indexes) but rereads data every time”
Its not only about getting the tweets. The real value of the data is knowing about what is being tweeted. Consider Facebook, where we can comment about any picture or a video. We need a platform to know what all the comments are positive about the video or how many are sledging it, or how many comments are real feedback about it. How many are providing suggestions to that to be a better one. And also you need to know how many times the video is shared and liked. Again those who all shared are whom, the one who dislikes it or likes it. Such so many varieties of data can be collected hence these are all called unstructured data.

Hbase Schema design

I have to design an Hbase table to store users information, this information is targeted for social networking, like: age, sex, education, hobbies, read books, traveled countries ...
NOTE: we could add more information in future, we dont know all information now.
for example:
name: Olha, age: 25, sex: female, education: bachelor Information technology, education: master computer science, hobby: basket ball, hobby: ping pong, book: gone with the wind, book: Davinci code, language: english, language: french, Country: Germany
The main idea is to be able to do queries like:
return all people who are female, age: 22 years old, speak: english, speak: french, read the book gone with the wind, like ping pong, like basket ball and German.
so you can add any criteria to the search query.
what is your suggestion about the HBASE table schema ( row key, column family ... ) that optimized this kind of search queries ( taking into consideration that we will add more information in future )
what is the best way to write such query ( scan, get, MapReduce ).
Thank you
I would agree with Ian Varley that Solr/Lucene and it's faceted queries and joins allow you to pivot the data in the way you want to see it - however - I also think your question might be a "counting" question or a "membership" question....
It sounds like you are after a list of people who match (N) attributes - the problem you have is that for each attribute you could have millions of user ids?
HBase is a good fit when all you are trying to do is compute intersection/union sizes.. Your key/value pairs can be put into Hbase, and you can "encode" the IDs of the users into either a Bloom Filter and HyperLogLog. Trading speed for accuracy and memory. Likely running map/reduce style jobs hourly/nightly on click-streams of log aggregation of some type.
Others have done this in the advertising space and online space for exactly the type of queries you are running ("find people who like red bull and pop-tarts that live in florida")
References
Contextual Advertising using Apache Hive and Amazon EMR http://aws.amazon.com/articles/2855
Scaling Distributed Counters: http://whynosql.com/scaling-distributed-counters/
Google: Sharding counters https://developers.google.com/appengine/articles/sharding_counters
Distributed Counter Performance in HBase - Part 1 http://palominodb.com/blog/2012/08/24/distributed-counter-performance-hbase-part-1
Facebook's New Realtime Analytics System: HBase To Process 20 Billion Events Per Day http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Realtime Analytics with Hadoop and HBase - http://www.slideshare.net/larsgeorge/realtime-analytics-with-hadoop-and-hbase
Log Event Processing with HBase http://tellapart.com/log-event-processing-with-hbase
Clickstream Analytics at BazaarVoice http://www.slideshare.net/bazaarvoice_engineering/austin-scales-clickstream-analytics
Realtime Analytics with HBase - http://www.slideshare.net/alexbaranau/realtime-analytics-with-hbase-long-version
This isn't a great use of HBase, in the sense that this is exactly the kind of thing that search indexes (like Lucene) are good for.
One normal schema to store users and their information might look a lot like a relational database, in that you'd have 1 row per user, and store all the attributes as columns & values (age=22, language=french, etc). This works well for the extensibility you mention (you don't need to change any schema in order to store new attributes). With this schema, you could look up any one user (and all of their attributes) by the unique user id. That'd be blazingly fast to do, no matter how many users you have.
However, with that schema, if you want to search in the way you describe ("return all users whose age is 22"), every single query is going to end up being a scan of the entire table, because HBase only allows you to access things via their primary key; it does not have secondary indexing of any kind. That will be extremely inefficient (picture having to scan a million rows every time you want to do any single query).
How to fix this? You could "reverse" the ordering of the data and put the values in the row key and then point to all the users with that value. For example, the row key could be "age:22", and then in the columns of the row could be all the userids that are age 22. This is problematic for a lot of reasons, not least of which is that it will be extremely expensive and tricky to make updates. But it would perform well for those specific queries.
The trick? That's exactly what a search index (like Lucene) does, and it does it much better than you could by rolling your own with HBase. That sounds like the tool you want to be using here.
If you must use HBase (as you say, since it's a research project) it might be worth looking into using HBase and Lucene together; google that for pointers.

Resources