is it possible for command time on Databricks to increase to almost double without changing the cluster specifications? what could be causing that?

is it possible for command time on Databricks to increase to almost double without changing the cluster specifications? what could be causing that? - performance

to store data in s3 bucket form Databricks, I used to write the following:
df.write.format("delta").mode("overwrite").save("s3://.....")
the code above used to take 3.27 minutes, it takes now 7.37 minutes, using the same
cluster configuration and the same data.

Related

Strategies to upload data to Redis cache

I have 4 millions user data that is looked up by their phone number. Source data is in S3. Search response time is 50 millisec and system need to be available 99.995%.
I am thinking about a nightly job like below
S3 Source Data -> Glue ETL -> CSV file or RDS -> Cache Upload job -> AWS Redis global datastore
I am leaning towards RDS since in case upload job fails for some reason I don't want to restart from beginning. Every row uploaded to Redis will be mark as processed.
I was planning to do an initial load of data from S3 to Redis and then do incremental load from thereafter. The data team has informed me that they can't produce the incremental data, in other words they can't tell me what data changed yesterday.
As a result, every day I will have to empty the Redis cache and reload the entire data. Since this upload will take considerable time, I am wondering how to keep the response time to 50 ms while data is loading.
Will appreciate any ideas. Thanks!

Spark EMR S3 Processing Large No of Files

I have around 15000 files (ORC) present in S3 where each file contain few minutes worth of data and size of each file varies between 300-700MB.
Since recursively looping through a directory present in YYYY/MM/DD/HH24/MIN format is expensive, I am creating a file which contain list of all S3 files for a given day (objects_list.txt) and passing this file as input to spark read API
val file_list = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/objects_list.txt"))
val paths: mutable.Set[String] = mutable.Set[String]()
for (line <- file_list.getLines()) {
if(line.length > 0 && line.contains("part"))
paths.add(line.trim)
}
val eventsDF = spark.read.format("orc").option("spark.sql.orc.filterPushdown","true").load(paths.toSeq: _*)
eventsDF.createOrReplaceTempView("events")
The Size of the cluster is 10 r3.4xlarge machines (workers)(Where Each Node: 120GB RAM and 16 cores) and master is of m3.2xlarge config (
The problem which am facing is, spark read was running endlessly and I see only driver working and rest all Nodes aren't doing anything and am not sure why driver is opening each S3 file for reading, because AFAIK spark works lazily so till an action is called reading shouldn't happen, I think it's listing each file and collecting some metadata associated with it.
But why only Driver is working and rest all Nodes aren't doing anything and how can I make this operation to run in parallel on all worker nodes ?
I have come across these articles https://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 and https://gist.github.com/snowindy/d438cb5256f9331f5eec, but here the entire file contents are being read as an RDD, but my use case is depending on the columns being referred only those blocks/columns of data should be fetched from S3 (columnar access given ORC is my storage) . Files in S3 have around 130 columns but only 20 fields are being referred and processed using dataframe API's
Sample Log Messages:
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=09/min=00/part-r-00199-e4ba7eee-fb98-4d4f-aecc-3f5685ff64a8.zlib.orc' for reading
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=19/min=00/part-r-00023-5e53e661-82ec-4ff1-8f4c-8e9419b2aadc.zlib.orc' for reading
You can see below that only One Executor is running that to driver program on one of the task Nodes(Cluster Mode) and CPU is 0% on rest of the other Nodes(i.e Workers) and even after 3-4 hours of processing, the situation is same given huge number of files have to be processed
Any Pointers on how can I avoid this issue, i.e speed up the load and process ?

There is a solution that can help you based in AWS Glue.
You have a lot of files partitioned in your S3. But you have partitions based in timestamp. So using glue you can use your objects in S3 like "hive tables" in your EMR.
First you need to create a EMR with version 5.8+ and you will be able to see this:
You can set up this checking both options. This will allow to access the AWS Glue Data Catalog.
After this you need to add the your root folder to the AWS Glue Catalog. The fast way to do that is using the Glue Crawler. This tool will crawl your data and will create the catalog as you need.
I will suggest you to take a look here.
After the crawler runs, this will have the metadata of your table in the catalog that you can see at AWS Athena.
In Athena you can check if your data was properly identified by the crawler.
This solution will make your spark works close to a real HDFS. Due to the metadata will be properly in the Data Catalog. And the time you app is taking to find the "indexing" will allow to run the jobs faster.
Working with this here I was able to improve the queries, and working with partitions was much better with glue. So, have a try this probably can help in the performance.

How to make a cached from a finished Spark Job still accessible for the other job?

My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it.
The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow.
The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache.
But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works.
We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB.
What I have already tried:
Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores.
Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale.
Try to use Cassandra but Cassandra does not support GROUP BY.
So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.

If you explicitly call cache or persist on a DataFrame, it will be saved in memory (and/or disk, depending on the storage level you choose) until the context is shut down. This is also valid for sqlContext.cacheTable.
So, as you are using Spark JobServer, you can create a long running context (using REST or at server start-up) and use it for multiple queries on the same dataset, because it will be cached until the context or the JobServer service shuts down. However, using this approach, you should make sure you have a good amount of memory available for this context, otherwise Spark will save a large portion of the data on disk, and this would have some impact on performance.
Additionally, the Named Objects feature of JobServer is useful for sharing specific objects among jobs, but this is not needed if you register your data as a temp table (df.registerTempTable("name")) and cache it (sqlContext.cacheTable("name")), because you will be able to query your table from multiple jobs (using sqlContext.sql or sqlContext.table), as long as these jobs are executed on the same context.

What's the proper way to log big data to organize and store it with Hadoop, and query it using Hive?

So basically I have apps on different platforms that are sending logging data to my server. It's a node server that essentially accepts a payload of log entries and it saves them to their respective log files (as write stream buffers, so it is fast), and creates a new log file whenever one fills up.
The way I'm storing my logs is essentially one file per "endpoint", and each log file consists of space separated values that correspond to metrics. For example, a player event log structure might look like this:
timestamp user mediatype event
and the log entry would then look like this
1433421453 bob iPhone play
Based off of reading documentation, I think this format is good for something like Hadoop. The way I think this works, is I will store these logs on a server, then run a cron job that periodically moves these files to S3. From S3, I could use those logs as a source for a Hadoop cluster using Amazon's EMR. From there, I could query it with Hive.
Does this approach make sense? Are there flaws in my logic? How should I be saving/moving these files around for Amazon's EMR? Do I need to concatenate all my log files into one giant one?
Also, what if I add a metric to a log in the future? Will that mess up all my previous data?
I realize I have a lot of questions, that's because I'm new to Big Data and need a solution. Thank you very much for your time, I appreciate it.

If you have a large volume of log dump that changes periodically, the approach you laid out makes sense. Using EMRFS, you can directly process the logs from S3 (which you probably know).
As you 'append' new log events to Hive, the part files will be produced. So, you dont have to concatenate them ahead of loading them to Hive.
(on day 0, the logs are in some delimited form, loaded to Hive, Part files are produced as a result of various transformations. On subsequent cycles, new events/logs will be appened to those part files.)
Adding new fields on an ongoing basis is a challenge. You can create new data structures/sets and Hive tables and join them. But the joins are going to be slow. So, you may want to define fillers/placeholders in your schema.
If you are going to receive streams of logs (lots of small log files/events) and need to run near real time analytics, then have a look at Kinesis.
(also test drive Impala. It is faster)
.. my 2c.

Map Reduce - How to plan the data files

I would like to use AWS EMR to query large log files that I will write to S3. I can design the files any way I like. The data is created in a rate of 10K entries/minute.
The logs consist of dozens of data points and I'd like to collect data for very long period of time (years) to compare trends etc.
What are the best practices for creating such files that will be stored on S3 and queried by AWS EMR cluster?
Whats the optimal file sizes ?Should I create separate files for example on hourly basis?
What is the best way to name the files?
Should I place them in daily/hourly buckets or all in the same bucket?
Whats the best way to handle things like adding some data after a while or change in data structure that I use?
Should I compress things for example by leaving out domain names out of urls or keep as much data as possible?
Is there any concept like partitioning (the data is based on 100s of websites so I can use site ids for example). I must be able to query all the data together, or by partitions.
Thanks!

in my opinion you should use a hourly basis bucket to store data in s3 and then use a pipeline to schedule your mr job to clean the data.
once u have clean the data you can keep it to a location in s3 and then you can run a data pipeline on hourly basis on the lag of 1hour with respect to your MR pipeline to put this process data into redshift.
Hence at 3am on a day you will have 3 hour of processed data in s3 and 2 hour processed into redshift dB.
To do this you can have 1 machine dedicated for running pipelines and on that machine you can define you shell script/perl/python or so script to load data to your dB.
You can use AWS bucketing formatter for year,month,date,hour and so on. for e.g.
{format(minusHours(#scheduledStartTime,2),'YYYY')}/mm=#{format(minusHours(#scheduledStartTime,2),'MM')}/dd=#{format(minusHours(#scheduledStartTime,2),'dd')}/hh=#{format(minusHours(#scheduledStartTime,2),'HH')}/*

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio