My problem.
I have 500,000 distinct IP address I need to geocode. The Geocode look up table have an ip-from and ip-to range that I have to compare against, a table of 1.8 million rows.
So it's basically:
select *
/*+ MAPJOIN(a) */
from ip_address a
cross join ip_lookup b
where a.AddressInt >= b.ip_from and a.AddressInt <= b.ip_to;
On aws EMR, I'm running a cluster of 10 m1.large and during the cross join phase it gets stuck at 0% for 20 min but here's the funny thing:
Stage-5: number of mappers: 1; number of reducers: 0
Questions:
1) any one have any better ideas than a cross join? I don't mind firing up a few (dozen) more instances but I doubt that will help and
2) am I REALLY doing a cross map join as in storing the ip_addresses in the memory?
Thanks in advance.
I had your (kind of) problem last year.
Since my geocode table fitted in RAM here's what I did:
I've written Java class (let's call it GeoCoder) that read geocode info from the disc into RAM and
did geocoding in-memory.
I've added file geocode.info to the distributed cache (Hive add file command does this).
I've written UDF that created (or used if it was already created) GeoCoder instance in the evaluate method. Hive UDF can get the local path of the file in the distributed cache via getClass().getClassLoader().getResource("geocode.info").getFile()
Now I have the local path of geocode.info (now it's an ordinary file) and the rest is a history.
Probably this method is an overkill (150 lines of Java) but it worked for me.
Also I assume that you really need to use Hadoop (like I did) for your task.
Geocoding of 500000 IPs could probably be done on laptop pretty fast.
Related
My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it.
The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow.
The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache.
But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works.
We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB.
What I have already tried:
Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores.
Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale.
Try to use Cassandra but Cassandra does not support GROUP BY.
So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.
If you explicitly call cache or persist on a DataFrame, it will be saved in memory (and/or disk, depending on the storage level you choose) until the context is shut down. This is also valid for sqlContext.cacheTable.
So, as you are using Spark JobServer, you can create a long running context (using REST or at server start-up) and use it for multiple queries on the same dataset, because it will be cached until the context or the JobServer service shuts down. However, using this approach, you should make sure you have a good amount of memory available for this context, otherwise Spark will save a large portion of the data on disk, and this would have some impact on performance.
Additionally, the Named Objects feature of JobServer is useful for sharing specific objects among jobs, but this is not needed if you register your data as a temp table (df.registerTempTable("name")) and cache it (sqlContext.cacheTable("name")), because you will be able to query your table from multiple jobs (using sqlContext.sql or sqlContext.table), as long as these jobs are executed on the same context.
What is the fastest and most efficient way to bulk delete hbase records? Hbase client API or a MapReduce job?
The HBase Client API does not allow to do bulk deletions unless you know the row keys of the cells you want to delete.
The BulkDeleteEndpoint can be leveraged to do bulk deletes based on the results of a scanner.
The fastest and most efficient way for large contiguous datasets is to drop entire regions by deleting their HDFS directories and removing them from the META table. This incurs virtually no IO so it's arguably almost free.
Note, however that this is not yet available directly through the high level APIs, so you have to script / code it in order to get it done.
Here's an example, from the HBase mailing lists, of how you could do it using the shell.
Close the region from the shell (read up on how this works using shell
help -- don't do unassign)
Then just delete the content of the region in HDFS once the region is
closed (the region dir name in HDFS is the same as the region encoded name,
the last portion of a region name -- check refguide).
After the delete in HDFS, call assign region.
Source http://search-hadoop.com/m/YGbbl9ZaSQ2HLT&subj=Re+Delete+a+region+from+hbase
HBase Client API is faster because you perform operations directly on the database while using MapReduce which means that tasks will run over jobs and that's take time according to my experience.
Over than that HBase will allow you to run specific operations in Column families that MapReduce can't do.
Currently I am in the process of investigating the possibility of using Cassandra in combination with Spark and Tableau for data analysis. However, the performance that I am currently experiencing with this setup is so poor that I cannot imagine using it for production purposes. As I am reading about how great the performance of the combination of Cassandra + Spark must be, I am obviously doing something wrong, yet I cannot find out what.
My test data:
All data is stored on a single node
Queries are performed on a single table with 50MB (interval data)
Columns used in selection criteria have an index on it
My test setup:
MacBook 2015, 1.1 GHz, 8GB memory, SSD, OS X El Capitan
Virtual Box, 4GB memory, Ubuntu 14.04
Single node wit Datastax Enterprise 4.8.4:
Apache Cassandra 2.1.12.1046
Apache Spark 1.4.2.2
Spark Connector 1.4.1
Apache Thrift 0.9.3
Hive Connector 0.2.11
Tableau (Connected through ODBC)
Findings:
When a change in Tableau requires loading data from the database, it takes anywhere between 40s and 1.4 mins. to retrieve the data (which is basically unworkable)
When I use Tableau in combination with Oracle instead of Cassandra + Spark, but on the same virtual box, I get the results almost instantaneously
Here is the table definition used for the queries:
CREATE TABLE key.activity (
interval timestamp,
id bigint,
activity_name text,
begin_ts timestamp,
busy_ms bigint,
container_code text,
duration_ms bigint,
end_location_code text,
end_ts timestamp,
pallet_code text,
src_location_code text,
start_location_code text,
success boolean,
tgt_location_code text,
transporter_name text,
PRIMARY KEY (interval, id)
) WITH CLUSTERING ORDER BY (id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"ALL"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX activity_activity_name_idx ON key.activity (activity_name);
CREATE INDEX activity_success_idx ON key.activity (success);
CREATE INDEX activity_transporter_name_idx ON key.activity (transporter_name);
Here is an example of a query produced by Tableau:
INFO 2016-02-10 20:22:21 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: Running query 'SELECT CASE WHEN 4 >= 0 THEN SUBSTRING(`activity`.`transporter_name`,1,CAST(4 AS INT)) ELSE NULL END AS `calculation_185421691185008640`,
AVG(CAST(`activity`.`busy_ms` AS DOUBLE)) AS `avg_busy_ms_ok`,
CAST((MONTH(`activity`.`interval`) - 1) / 3 + 1 AS BIGINT) AS `qr_interval_ok`,
`activity`.`transporter_name` AS `transporter_name`,
YEAR(`activity`.`interval`) AS `yr_interval_ok`
FROM `key`.`activity` `activity`
GROUP BY CASE WHEN 4 >= 0 THEN SUBSTRING(`activity`.`transporter_name`,1,CAST(4 AS INT)) ELSE NULL END,
CAST((MONTH(`activity`.`interval`) - 1) / 3 + 1 AS BIGINT),
`activity`.`transporter_name`,
YEAR(`activity`.`interval`)'
Here is an example on statistics of a 52s query:
Spark statistics on query taken 52 secs. to complete
I've tried playing around with the partition keys as mentioned in other posts, but did not see a significant difference. I've also tried to enable row caching (Cassandra config + table property), but this also did not have any effect (although perhaps I have overlooked something there).
I would have expected to get at least a factor 10x-20x better performance out of the box, even without fiddling around with all these parameters and I've run out of ideas what to do.
What am I doing wrong? What performance should I expect?
Answering your questions will not be easy due to the variables that you do not define in your post. You mention data that is stored on one node, which is fine but you don't describe how you have structured your tables/column families. You also don't mention the cassandra cache hit ratios. You also have to consider Cassandra Compaction, if compaction is running during the heavy read/write operations it will slow things down.
You also appear to have a single SSD in which case you will have the Data directory and commitlogs and cache directories on the same physical drive. Even though it is not a spinning disc you will see degraded performance unless you split the data dir from the commitlogs/cache directories. I saw a 50% increase in performance by splitting the Data dir onto its own physical SSD.
Also, lastly you're running in a VM on a laptop host in Vbox none the less. Your largest bottleneck here is the 1.1 GHz CPU. In my cassandra environments on VMWare while running medium jobs I see almost 99% CPU use across 4 X 2 cores on 16GB RAM. My data dir(s) are on SSD's while my commitlogs and cache directories are on a magnetic HDD. I get good performance, but I tuned my environments to get to this point and I accept the latency my non production environments provide.
Take a look HERE and try to get a better understanding of how Cassandra should be used and how to achieve better performance out of the box. Distributed Systems are just that.. distributed and for a reason. Shared resources that you don't have available on a single machine.
Hope this explains a little more about where you're headed.
EDIT
Your table definition looks fine. Are you using the Tableau Spark connector? Your performance problem is likely on the cassandra/Spark side of things.
Take a look at this article which describes a compaction related problem while reading from cache. Basically on cassandra releases prior to 2.1.2 post compaction you now have lost your cache because Cassandra threw the file (and cache) away once the compaction finished. Once you start reading you imediately get a missed cache hit and cassandra then goes back to disc. This is fixed in releases from 2.1.2 onward. Everything else looks normal with respect towards running Spark/Cassandra.
While the query time does seem a little high, there's a few things I see that could cause issues.
I noticed you're using a MacBook. Beautiful computer but not ideal for Spark. I believe those are using the dual core Intel M processors. If you go to your Spark Master UI, it'll show you available cores. It might show 4 (to include vCPU's).
The nature in which you are running this query doesn't allow for a lot of parallelism (if any). You basically don't get the advantages of Spark in this case because you're running in an extremely small VM and you're running on a single node (with limited CPU's). Visualization tools haven't really caught up to Spark yet.
One other thing to keep in mind is that Spark is not designed as an 'adhoc query' tool. You can think of SparkSQL as an abstraction over proper Spark Batch. Comparing it to Oracle, at this scale, wont yield the results you expect. There's a 'minimum' performance threshold that you'll notice with Spark. Once you scale data and nodes far enough, you'll start to see that time to completion and size of data is not linear and as you add more data, the time to process remains relatively flat.
I suggest trying that query in the SparkSQL REPL dse spark-sql and see if you get similar times. If you do, then you know that's the best you'll get with your current setup. If Tableau is MUCH slower than the REPL, I'd guess it's something on their end at that point.
I am using
Hbase:0.92.1-cdh4.1.2, and
Hadoop:2.0.0-cdh4.1.2
I have a mapreduce program that will load data from HDFS to HBase using HFileOutputFormat in cluster mode.
In that mapreduce program i'm using HFileOutputFormat.configureIncrementalLoad() to bulk load a 800000 record
data set which is of 7.3GB size and it is running fine, but it's not running for 900000 record data set which is of 8.3GB.
In the case of 8.3GB data my mapreduce program have 133 maps and one reducer,all maps completed successfully.My reducer status is always in Pending for a long time. There is nothing wrong with the cluster since other jobs are running fine and this job also running fine upto 7.3GB of data.
What could i be doing wrong?
How do I fix this issue?
I ran into the same problem. Looking at the DataTracker logs, I noticed there was not enough free space for the single reducer to run on any of my nodes:
2013-09-15 16:55:19,385 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_slave01.mydomain.com:localhost/127.0.0.1:43455 has 503,777,017,856 bytes free; but we expect reduce input to take 978136413988
This 503gb refers to the free space available on one of the hard drives on the particular slave ("tracker_slave01.mydomain.com"), thus the reducer apparently needs to copy all the data to a single drive.
The reason this happens is your table only has one region when it is brand new. As data is inserted into that region, it'll eventually split on its own.
A solution to this is to pre-create your regions when creating your table. The Bulk Loading Chapter in the HBase book discusses this, and presents two options for doing this. This can also be done via the HBase shell (see create's SPLITS argument I think). The challenge though is defining your splits such that the regions get an even distribution of keys. I've yet to solve this problem perfectly, but here's what I'm doing currently:
HTableDescriptor desc = new HTableDescriptor();
desc.setName(Bytes.toBytes(tableName));
desc.addFamily(new HColumnDescriptor("my_col_fam"));
admin.createTable(desc, Bytes.toBytes(0), Bytes.toBytes(2147483647), 100);
An alternative solution would be to not use configureIncrementalLoad, and instead: 1) just generate your HFile's via MapReduce w/ no reducers; 2) use completebulkload feature in hbase.jar to import your records to HBase. Of course, I think this runs into the same problem with regions, so you'll want to create the regions ahead of time too (I think).
Your job is running with single reduces, means 7GB data getting processed on single task.
The main reason of this is HFileOutputFormat starts reducer that sorts and merges data to be loaded in HBase table.
here, Num of Reducer = num of regions in HBase table
Increase the number of regions and you will achieve parallelism in reducers. :)
You can get more details here:
http://databuzzprd.blogspot.in/2013/11/bulk-load-data-in-hbase-table.html
I have a requirement of parsing both Apache access logs and tomcat logs one after another using map reduce. Few fields are being extracted from tomcat log and rest from Apache log.I need to merge /map extracted fields based on the timestamp and export these mapped fields into a traditional relational db ( ex. MySQL ).
I can parse and extract information using regular expression or pig. The challenge i am facing is on how to map extracted information from both logs into a single aggregate format or file and how to export this data to MYSQL.
Few approaches I am thinking of
1) Write output of map reduce from both parsed Apache access logs and tomcat logs into separate files and merge those into a single file ( again based on timestamp ). Export this data to MySQL.
2) Use Hbase or Hive to store data in table format in hadoop and export that to MySQL
3) Directly write the output of map reduce to MySQL using JDBC.
Which approach would be most viable and also please suggest any other alternative solutions you know.
It's almost always preferable to have smaller, simpler MR jobs and chain them together than to have large, complex jobs. I think your best option is to go with something like #1. In other words:
Process Apache httpd logs into a unified format.
Process Tomcat logs into a unified format.
Join the output of 1 and 2 using whatever logic makes sense, writing the result into the same format.
Export the resulting dataset to your database.
You can probably perform the join and transform (1 and 2) in the same step. Use the map to transform and do a reduce side join.
It doesn't sound like you need / want the overhead of random access so I wouldn't look at HBase. This isn't its strong point (although you could do it in the random access sense by looking up each record in HBase by timestamp, seeing if it exists, merging the record in, or simply inserting if it doesn't exist, but this is very slow, comparatively). Hive could be conveinnient to store the "unified" result of the two formats, but you'd still have to transform the records into that format.
You absolutely do not want to have the reducer write to MySQL directly. This effectively creates a DDOS attack on the database. Consider a cluster of 10 nodes, each running 5 reducers, you'll have 50 concurrent writers to the same table. As you grow the cluster you'll exceed max connections very quickly and choke the RDBMS.
All of that said, ask yourself if it makes sense to put this much data into the database, if you're considering the full log records. This amount of data is precisely the type of case Hadoop itself is meant to store and process long term. If you're computing aggregates of this data, by all means, toss it into MySQL.
Hope this helps.