I am new to large scala data analytics and archiving so I though I ask this question to see if I am looking at things the right way.
Current requirement:
I have large number of static files in the filesystem. Csv, Eml, Txt, Json
I need to warehouse this data for archiving / legal reasons
I need to provide a unified search facility MAIN functionality
Future requirement:
I need to enrich the data file with additional metadata
I need to do analytics on the data
I might need to ingest data from other sources from API etc.
I would like to come up with a relatively simple solution with the possibility that I can expand it later with additional parts without having to rewrite bits. Ideally I would like to keep each part as a simple service.
As currently search is the KEY and I am experienced with Elasticsearch I though I would use ES for distributed search.
I have the following questions:
Should I copy the file from static storage to Hadoop?
is there any virtue keeping the data in HBASE instead of individual files ?
is there a way that once a file is added to Hadoop I can trigger an event to index the file into Elasticsearch ?
is there perhaps a simpler way to monitor hundreds of folders for new files and push them to Elasticsearch?
I am sure I am overcomplicating this as I am new to this field. Hence I would appreciate some ideas / directions I should explore to do something simple but future proof.
Thanks for looking!
Regards,
Related
Currently I am using cassandra for storing data for my functional use cases (display time-series and consolidated data to users). Cassandra is very good at it, if you design correctly your data model (query driven)
Basically, data are ingested from RabbitMQ by Storm and save to Cassandra
Lambda architecture is just a design-pattern for big-data architect and technology independent, the layers can be combined :
Cassandra is a database that can be used as serving layer & batch layer : I'm using it for my analytics purpose with spark too (because data are already well formatted, like time-series, in cassandra)
As far as I know, one huge thing to consider is STORING your raw data before any processing. You need to do this in order to recover for any problem, human-based (algorithm problem, DROP TABLE in PROD, stuff like that this can happen..) or for future use or mainly for batch aggregation
And here I'm facing a choice :
Currently I'm storing it in cassandra, but i'm consider switching storing the raw data in HDFS for different reason : raw data are "dead", using cassandra token, using resource (mainly disk space) in cassandra cluster.
Can someone help me in that choice ?
HDFS makes perfect sense. Some considerations :
Serialization of data - Use ORC/ Parquet or AVRO if format is variable
Compression of data - Always compress
HDFS does not like too many small files - In case of streaming have a job which aggregates & write single large file on a regular interval
Have a good partitioning scheme so you can get to data you want on HDFS without wasting resources
hdfs is better idea for binary files. Cassandra is o.k. for storing locations where the files are etc etc but just pure files need to be modelled really really well so most of the people just give up on cassandra and complain that it sucks. It still can be done, if you want to do it there are some examples like:
https://academy.datastax.com/resources/datastax-reference-application-killrvideo
that might help you to get started.
Also the question is more material for quora or even http://www.mail-archive.com/user#cassandra.apache.org/ this question has been asked there a lot of time.
I am new to Spark; looks awesome!
I have gobs of hourly logfiles from different sources, and wanted to create DStreams from them with a sliding window of ~5 minutes to explore correlations.
I'm just wondering what the best approach to accomplish this might be. Should I chop them up into 5-minute chunks in different directories? How would that naming structure be associated with a particular timeslice across different HDFS directories? Do I implement a filter() method that knows the log record's embedded timestamp?
suggestions, RTFMs welcomed.
thanks!
Chris
You can use apache Kafka as Dstream source and then you can try reduceByKeyAndWindow Dstream function. It will create a window according your required time
Trying to understand spark streaming windowing
So basically I have apps on different platforms that are sending logging data to my server. It's a node server that essentially accepts a payload of log entries and it saves them to their respective log files (as write stream buffers, so it is fast), and creates a new log file whenever one fills up.
The way I'm storing my logs is essentially one file per "endpoint", and each log file consists of space separated values that correspond to metrics. For example, a player event log structure might look like this:
timestamp user mediatype event
and the log entry would then look like this
1433421453 bob iPhone play
Based off of reading documentation, I think this format is good for something like Hadoop. The way I think this works, is I will store these logs on a server, then run a cron job that periodically moves these files to S3. From S3, I could use those logs as a source for a Hadoop cluster using Amazon's EMR. From there, I could query it with Hive.
Does this approach make sense? Are there flaws in my logic? How should I be saving/moving these files around for Amazon's EMR? Do I need to concatenate all my log files into one giant one?
Also, what if I add a metric to a log in the future? Will that mess up all my previous data?
I realize I have a lot of questions, that's because I'm new to Big Data and need a solution. Thank you very much for your time, I appreciate it.
If you have a large volume of log dump that changes periodically, the approach you laid out makes sense. Using EMRFS, you can directly process the logs from S3 (which you probably know).
As you 'append' new log events to Hive, the part files will be produced. So, you dont have to concatenate them ahead of loading them to Hive.
(on day 0, the logs are in some delimited form, loaded to Hive, Part files are produced as a result of various transformations. On subsequent cycles, new events/logs will be appened to those part files.)
Adding new fields on an ongoing basis is a challenge. You can create new data structures/sets and Hive tables and join them. But the joins are going to be slow. So, you may want to define fillers/placeholders in your schema.
If you are going to receive streams of logs (lots of small log files/events) and need to run near real time analytics, then have a look at Kinesis.
(also test drive Impala. It is faster)
.. my 2c.
As the title said, how to sort the file? If you PC's memory is just 2GB, but there are ten billion URLs(assume that the longest URL is 256 chars).
Your question is little vague, but I'm assuming :
You have a flat file containing many URLs.
The URLs are delimited somehow, I'm assuming newlines.
You want to create a separate file without duplicates.
Possible solutions :
Write code to read each URL in turn from the file, and insert into a relational database. Make the primary key be the URL, and any duplicates will be rejected.
Build your own index. This is a little more complex. You would need to use something like a disk-based btree implementation. Then read each URL, and add it to the disk-based BTree. Again, check for duplicates as you add to the tree.
However, given all the free database systems out there, solution 1 is probably the way to go.
If you've got a lot of data, then Hadoop either is, or should be on your radar.
In that HDFS is used to store the huge volume of data and also be a lot of tools for query with that data.
In HDFS the data processing is very effective and fast. you can use the No-sql tool like Hive and also other tool like Pig,etc.
Now the YAHOO using the Big-Data technology for huge amount of data processing. Also Hadoop
is open source.
Refer http://hadoop.apache.org/ for more.
Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.