What are the best practices for managing huge schema using express-graphql - express-graphql

I'm seeking advice. My schema.js file is reaching 1000 lines and I'm only halfway. What would be a decent file structure to spill schema.js onto?

Related

Big Data - Lambda Architecture and Storing Raw Data

Currently I am using cassandra for storing data for my functional use cases (display time-series and consolidated data to users). Cassandra is very good at it, if you design correctly your data model (query driven)
Basically, data are ingested from RabbitMQ by Storm and save to Cassandra
Lambda architecture is just a design-pattern for big-data architect and technology independent, the layers can be combined :
Cassandra is a database that can be used as serving layer & batch layer : I'm using it for my analytics purpose with spark too (because data are already well formatted, like time-series, in cassandra)
As far as I know, one huge thing to consider is STORING your raw data before any processing. You need to do this in order to recover for any problem, human-based (algorithm problem, DROP TABLE in PROD, stuff like that this can happen..) or for future use or mainly for batch aggregation
And here I'm facing a choice :
Currently I'm storing it in cassandra, but i'm consider switching storing the raw data in HDFS for different reason : raw data are "dead", using cassandra token, using resource (mainly disk space) in cassandra cluster.
Can someone help me in that choice ?
HDFS makes perfect sense. Some considerations :
Serialization of data - Use ORC/ Parquet or AVRO if format is variable
Compression of data - Always compress
HDFS does not like too many small files - In case of streaming have a job which aggregates & write single large file on a regular interval
Have a good partitioning scheme so you can get to data you want on HDFS without wasting resources
hdfs is better idea for binary files. Cassandra is o.k. for storing locations where the files are etc etc but just pure files need to be modelled really really well so most of the people just give up on cassandra and complain that it sucks. It still can be done, if you want to do it there are some examples like:
https://academy.datastax.com/resources/datastax-reference-application-killrvideo
that might help you to get started.
Also the question is more material for quora or even http://www.mail-archive.com/user#cassandra.apache.org/ this question has been asked there a lot of time.

Large scale static file ( csv txt etc ) archiving solution

I am new to large scala data analytics and archiving so I though I ask this question to see if I am looking at things the right way.
Current requirement:
I have large number of static files in the filesystem. Csv, Eml, Txt, Json
I need to warehouse this data for archiving / legal reasons
I need to provide a unified search facility MAIN functionality
Future requirement:
I need to enrich the data file with additional metadata
I need to do analytics on the data
I might need to ingest data from other sources from API etc.
I would like to come up with a relatively simple solution with the possibility that I can expand it later with additional parts without having to rewrite bits. Ideally I would like to keep each part as a simple service.
As currently search is the KEY and I am experienced with Elasticsearch I though I would use ES for distributed search.
I have the following questions:
Should I copy the file from static storage to Hadoop?
is there any virtue keeping the data in HBASE instead of individual files ?
is there a way that once a file is added to Hadoop I can trigger an event to index the file into Elasticsearch ?
is there perhaps a simpler way to monitor hundreds of folders for new files and push them to Elasticsearch?
I am sure I am overcomplicating this as I am new to this field. Hence I would appreciate some ideas / directions I should explore to do something simple but future proof.
Thanks for looking!
Regards,

Using Parquet for realtime queries

I'm trying to come up with a solution for doing realtime (maybe within 0.x second), and I'm going to use Parquet to store the data. I want to use Presto and API to query the data.
My question is, since Parquet stores data in HDFS, where files are invisible until closed, how do I effectively achieve the near realtime query results?
The Parquet files must be closed in HDFS quickly enough, in order to let the query tool to see and use them. But, that means I can't put too much data into each Parquet file, ending up with too many small files and/or not real-time enough. Any better ideas, or Parquet is not a good format for realtime solutions?
Thanks for any input!

How to sort and remove the repeat URLs(the file contain about ten billion URLs)!

As the title said, how to sort the file? If you PC's memory is just 2GB, but there are ten billion URLs(assume that the longest URL is 256 chars).
Your question is little vague, but I'm assuming :
You have a flat file containing many URLs.
The URLs are delimited somehow, I'm assuming newlines.
You want to create a separate file without duplicates.
Possible solutions :
Write code to read each URL in turn from the file, and insert into a relational database. Make the primary key be the URL, and any duplicates will be rejected.
Build your own index. This is a little more complex. You would need to use something like a disk-based btree implementation. Then read each URL, and add it to the disk-based BTree. Again, check for duplicates as you add to the tree.
However, given all the free database systems out there, solution 1 is probably the way to go.
If you've got a lot of data, then Hadoop either is, or should be on your radar.
In that HDFS is used to store the huge volume of data and also be a lot of tools for query with that data.
In HDFS the data processing is very effective and fast. you can use the No-sql tool like Hive and also other tool like Pig,etc.
Now the YAHOO using the Big-Data technology for huge amount of data processing. Also Hadoop
is open source.
Refer http://hadoop.apache.org/ for more.

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Resources