Hadoop Pig save each line of a file to S3 - hadoop

Currently, I have Pig script running on top of Amazon EMR to load a bunch of files from S3 and then I will do the filter processing and group the data into phone number, so the data will be like (phonenumber:chararray, bag:{mydata:chararray}). Next I will have to store each phone number into different S3 buckets (possibly buckets in different accounts that I have access to). Seems org.apache.pig.piggybank.storage.MultiStorage is the best use at here, but it doesn't work, as there are 2 problems I am facing:
There are a lot of phone numbers (approximate 20,000), to store each
phone number into different S3 buckets is very very slow and the
program is even out of memory.
There is no way for me to look
up my lookup table to decide where is the buckets to store into.
So I am wondering if anyone can help out? The second one probably can solve by written up my own UDF store function, but for the first one, how to solve it? Thanks.

S3 is limited to 100 buckets per account. More than that, the creation of a bucket is not immediate, as you need to wait for the bucket to be ready.
However you can have as many objects as you want in a bucket. You can write the phone numbers as different object relatively quick. Especially if you are taking care at the name of your objects: objects in S3 are stored by prefix. If you are giving all your objects that same prefix, S3 will try to put all of them on the same "hot" area, getting less performance. If you choose the prefix to be different (usually simply reverse the id or time), you will improve it significantly.
You can also take a look at DynamoDB, which is a scalable NoSQL DB in AWS. You can get very high throughput for the time of building your index. You can later export it to S3 as well, using Hive over Elastic MapReduce.

Related

How should I use Azure Blob Containers?

My current approach is that I have a few containers:
raw (the actual raw files or exports, separated into folders like servicenow-cases, servicenow-users, playvox-evaluations, etc.)
staging (lightly transformed raw data)
analytics (these are Parquet file directories which consolidate and partition the files)
visualization (we use a 3rd party tool which syncs with Azure Blob, but only CSV files currently. This is almost the exact same as the analytics container)
However, it could also make some sense to create more containers and kind of use them like I would use a database schema. For example, one container for ServiceNow data, another for LogMeIn data, another for our telephony system, etc.
Is there any preferred approach?
Based on your description, it seems you are tangled to use a small number of containers to store a large number of blobs or make a large number of containers to store a small number of blobs. If all you think about is parallelism and scalability, you can rest assured, just you design a storage structure that suits you. Because partitioning in Azure Blob storage is done at the blob level, not the container.
Each of these two approaches has their advantages and disadvantages.
For a small number of containers, it can save the cost of creating containers (the operation of creating containers need you to pay for it). But at the same time, when you try to list the blobs in the container, the objects in it will be listed. If you still have a subset inside, you still need to continue to obtain, in this case the performance is less than the Lots of Container Solution. And at the same time, the security boundary you set will apply to all blobs in this container. This is not necessarily what you want.
For a large number of structured containers, more containers can set more security boundaries (custom access permissions, access control SAS signatures). It is also easy to list blobs, no more messy subsets are needed to catch. But again, its disadvantage is that it will have more consumption in creating containers (in extreme cases, it will increase a lot of costs. In general, it does not matter. a website that calculates costs: https://azure.microsoft.com/en-us/pricing/calculator/?cdn=disable).

Advantages of database vs hash for simple key value lookup (in Ruby)

Suppose I have a production app on AWS with, let's say 50,000 users, and I need to simply take a username and lookup one or two pieces of information about them.
Is there any advantage to keeping this information in a DynamoDB over a Ruby hash stored in an AWS S3 bucket?
By "advantages" I mean both cost and speed.
At some point, will I need to migrate to a DB, or will a simple hash lookup suffice? Again, I will never need to compare entries, or do anything but lookup the values associates with a key (username).
The more general question is: what are the advantages of a DB (like DynamoDB) over an S3 hash for the purposes of simple key/value storage?
You should note that Hash cannot be used as database, it must be loaded with values from some data store (such as a database, a JSON, YAML file or equivalent). On the contrary, DynamoDB is a database and has persistence built-in.
Having said that, for 50,000 entries, a Ruby Hash should be a viable option, it will perform quite well as indicated in this article.
Ruby Hash is not distributed, hence, if you run your app on multiple servers for availability/scalability, then, you will have to load that Hash in each server and maintain its data consistency. In other words, you need to make sure that if one of the user attributes gets updated via one server, how will you replicate its value across other server. Also, if number of users in your system is not 50,000 but 50 million - then, you may have to rethink Hash as cache option.
DynamoDB is full blown NoSQL database - it is distributed and promises high scalability. It also costs money to use it - so your decisions to use it should be based on whether you need such a scale and availability offered by DynamoDB, and whether you have budget for it.

How reliable is ElasticSearch as a primary datastore against factors like write loss, data availability

I am working on a project with a requirement of coming up with a generic dashboard where a users can do different kinds of grouping, filtering and drill down on different fields. For this we are looking for a search store that allows slice and dice of data.
There would be multiple sources of data and would be storing it in the Search Store. There may be some pre-computation required on the source data which can be done by an intermediate components.
I have looked through several blogs to understand whether ES can be used reliably as a primary datastore too. It mostly depends on the use-case we are looking for. Some of the information about the use case that we have :
Around 300 million record each year with 1-2 KB.
Assuming storing 1 year data, we are today with 300 GB but use-case can go up to 400-500 GB given growth of data.
As of now not sure, how we will push data, but roughly, it can go up to ~2-3 million records per 5 minutes.
Search request are low, but requires complex queries which can search data for last 6 weeks to 6 months.
document will be indexed across almost all the fields in document.
Some blogs say that it is reliable enough to use as a primary data store -
http://chrisberkhout.com/blog/elasticsearch-as-a-primary-data-store/
http://highscalability.com/blog/2014/1/6/how-hipchat-stores-and-indexes-billions-of-messages-using-el.html
https://karussell.wordpress.com/2011/07/13/jetslide-uses-elasticsearch-as-database/
And some blogs say that ES have few limitations -
https://www.found.no/foundation/elasticsearch-as-nosql/
https://www.found.no/foundation/crash-elasticsearch/
http://www.quora.com/Why-should-I-NOT-use-ElasticSearch-as-my-primary-datastore
Has anyone used Elastic Search as the sole truth of data without having a primary storage like PostgreSQL, DynamoDB or RDS? I have looked up that ES has certain issues like split brains and index corruption where there can be a problem with the data loss. So, I am looking to know if anyone has used ES and have got into any troubles with the data
Thanks.
Short answer: it depends on your use case, but you probably don't want to use it as a primary store.
Longer answer: You should really understand all of the possible issues that can come up around resiliency and data loss. Elastic has some great documentation of these issues which you should really understand before using it as a primary data store. In addition Aphyr's post on the topic is a good resource.
If you understand the risks you are taking and you believe that those risks are acceptable (e.g. because small data loss is not a problem for your application) then you should feel free to go ahead and try it.
It is generally a good idea to design redundant data storage solutions. For example, it could be a fast and reliable approach to first just push everything as flat data to a static storage like s3 then have ES pull and index data from there. If you need more flexibility leveraging some ORM, you could have an RDS or Redshift layer in between. This way the data can always be rebuilt in ES.
It depends on your needs and requirements how you set the balance between redundancy and flexibility/performance. If there's a lot of data involved, you could store the raw data statically and just index some parts of it by ES.
Amazon Lambda offers great features:
Many developers store objects in Amazon S3 while using Amazon DynamoDB
to store and index the object metadata and enable high speed search.
AWS Lambda makes it easy to keep everything in sync by running a
function to automatically update the index in Amazon DynamoDB every
time objects are added or updated from Amazon S3.
Since 2015 when this question was originally posted a lot of resiliency issues have been found and addressed, and in recent years a lot of features and specifically stability and resiliency features have been added, that it's definitely something to consider given the right use-cases and leveraging the right features in the right way.
So as of 2022, my answer to this question is - yes you can, as long as you do it correctly and for the right use-case.

Which solution for GPS analytics (storage and processing) on Heroku/Amazon EMR

happy new year and best wishes!
we are collecting a great amount of GPS positions for analytics purpose that we would like to store and process (2-3GB daily data) using Heroku / Amazon services. We are looking for a suitable solution. We initially thought about a system where the data are directly uploaded to Amazon S3, a Worker Dyno constantly tries to process them and puts the GPS positions to a Heroku PostGIS database, then another Worker Dyno would be used on demand to compute analytics output on the fly. We also heard about Amazon Elastic Map Reduce that works directly with raw data in S3 without a PostGIS database. And we need your guidance.
What are your recommendation for this kind of needs for storing and processing data (Heroku add-on, architectures, etc)? What do you think of the 2 alternatives listed above?
Many thanks
It is difficult to give a precise answer as the details of your processing are not clear. Do you need per user analytics, per region analytics, across days etc.
I can point you to some related services:
Amazon Kinesis - a new service that is targeted at such use cases (Internet of things like). You can PUT your readings from various sources (including directly from the mobile devices) and read them on the server side.
Amazon DynamoDB - NoSQL DB that AWS recently added a geospatial library for it: http://www.allthingsdistributed.com/2013/09/dynamodb-geospatial.html
http://aws.typepad.com/aws/2013/09/new-geo-library-for-dynamodb-.html
RDS with PostgreSQL - PostgreSQL is very good for GIS calculation and with RDS it is even easier to manage as most of the DBA work that is needed (installation, updates, backup, restore, etc.) are done by RDS service.
S3 - THE place to store your data for batch processing. Note that it is best to have larger files for most processing cases like EMR. You can have a connector that reading the data from Kinesis and storing it into S3 (see GitHub example: https://github.com/awslabs/amazon-kinesis-connectors/tree/master/src/main/java/com/amazonaws/services/kinesis/connectors/s3)
Amazon EMR - this is the cluster management service that makes running jobs like Hadoop jobs much easier. You can find a presentation about using EMR for geospatial analytics in re:invent BDT201 and video
You should also consider pre-processing the data to limit the number of redudant records. Most of your positions are likely to be at the same location. In otherwords, the device will be sitting still much of the time.
one approach is to store a new position only if its speed is greater then 0 and the last stored location is also at 0. That way you store only the fist location after the device stops moving. There will be noise on the GPS speed so you iwll not get rid of every resting position.
Another option would be to store only when a new position is some distance from the previously stored position.
You can always return a result for any requested time by finding the closest record before the requested timestamp.
If you use the range compression, consider setting the required distance at least as large as the expected RMS error for the GPS device, likly to be about 5M minimum, use a longer distance if you can stand it.
Doing the math for distance between Geo locations can be resource expensive, pre-calculate a delta lat lon value to use with incoming positions to speed that up.
EMR launched Kinesis connector so one could process such a dataset using familiar tools in Hadoop Ecosystem. Did you see http://aws.typepad.com/aws/2014/02/process-streaming-data-with-kinesis-and-elastic-mapreduce.html ?

Storage for Write Once Read Many

I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.
To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).
Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.
If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.
Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.
Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.

Resources