h2o - deep learning Cartesian grid search - same train& validation - same hyper parameters generate different model - h2o

I am a newbee in AI/ML and found h2o recently.
I am experimenting with grid search for deep learning using Cartesian search since I want to run all different combinations. I had two runs using the same train and validation files and same set of hyper search parameters along with the grid.train parameters. Both runs generate same number of models and each model is generated with same input parameters "activation adaptive_rate epsilon hidden hidden_dropout_ratios input_dropout_ratio rho".
My observation is that for each run using same input parameters, the model generated has a different logloss, mean per class error, MSE, RMSE, etc.
To reduce further user errors, I just limited the grid search to only 1 set of parameters. My finding are below with detailed logs, etc.
My question is how do I guarantee that the model generated are exactly the same, given the same set of parameters and train/validation frame.
Train & Validation file format and data
BPS1,BPS2,ZSRTN,PCNT_RTN,PCNT_RTN100,Open,High,Low,Close,Time
58,18 , 3.00 , -0.12 , -12 , 297.2700 , 297.3100 , 297.0800 , 297.1700 , 201907050935
18,20 , 3.00 , -0.11 , -11 , 297.1800 , 297.1900 , 296.9300 , 296.9400 , 201907050940
20,20 , 5.00 , 0.01 , 1 , 296.9400 , 297.2600 , 296.8200 , 297.2150 , 201907050945
20,30 , 5.00 , 0.03 , 3 , 297.2200 , 297.2600 , 297.0400 , 297.0400 , 201907050950
Values
activation = RectifierWithDropout
adaptive_rate = true
epsilon = 1.0E-6
hidden = [200]
hidden_dropout_ratios = [0.1]
input_dropout_ratio 0.05
rho = 0.9
Python Code
hyper_parameters = {
"hidden": [[200]],
"epsilon" : 1.0E-6,
"adaptive_rate": True,
"activation": ["RectifierWithDropout"],
"input_dropout_ratio" : [0.05],
"hidden_dropout_ratios" : [0.1],
"rho":[0.9]
}
.....
search_criteria = {"strategy": "Cartesian"}
.....
model_grid = H2OGridSearch(model = H2ODeepLearningEstimator,
grid_id = project_name,
hyper_params=hyper_parameters,
search_criteria=search_criteria)
model_grid.train(x=x,
y = response_column,
distribution=default_distribution, epochs=10000,
training_frame=train, validation_frame=test,
score_interval=0, stopping_rounds=5,
stopping_tolerance=1e-3,
stopping_metric="mean_per_class_error")
Prepare for First run
07-02 09:27:45.989 192.168.123.5:54321 #7248 #75857-26 INFO: Starting gridsearch: estimated size of search space = 1
07-02 09:27:45.990 192.168.123.5:54321 #7248 FJ-1-51 INFO: Due to the grid time limit, changing model max runtime to: 1.7976931348623157E308 secs.
07-02 09:27:45.992 192.168.123.5:54321 #7248 FJ-1-51 INFO: Building H2O DeepLearning model with these parameters:
07-02 09:27:45.992 192.168.123.5:54321 #7248 FJ-1-51 INFO: {"_train":{"name":"py_1_sid_b81b","type":"Key"},"_valid":{"name":"py_2_sid_b81b","type":"Key"},"_nfolds":0,"_keep_cross_validation_models":true,"_keep_cross_validation_predictions":false,"_keep_cross_validation_fold_assignment":false,"_parallelize_cross_validation":true,"_auto_rebalance":true,"_seed":-1,"_fold_assignment":"AUTO","_categorical_encoding":"AUTO","_max_categorical_levels":10,"_distribution":"AUTO","_tweedie_power":1.5,"_quantile_alpha":0.5,"_huber_alpha":0.9,"_ignored_columns":["Close","PCNT_RTN100","High","Low","PCNT_RTN","Time","Open"],"_ignore_const_cols":true,"_weights_column":null,"_offset_column":null,"_fold_column":null,"_check_constant_response":true,"_is_cv_model":false,"_score_each_iteration":false,"_max_runtime_secs":1.7976931348623157E308,"_stopping_rounds":5,"_stopping_metric":"mean_per_class_error","_stopping_tolerance":0.001,"_response_column":"ZSRTN","_balance_classes":false,"_max_after_balance_size":5.0,"_class_sampling_factors":null,"_max_confusion_matrix_size":20,"_checkpoint":null,"_pretrained_autoencoder":null,"_custom_metric_func":null,"_custom_distribution_func":null,"_export_checkpoints_dir":null,"_overwrite_with_best_model":true,"_autoencoder":false,"_use_all_factor_levels":true,"_standardize":true,"_activation":"RectifierWithDropout","_hidden":[200],"_epochs":10000.0,"_train_samples_per_iteration":-2,"_target_ratio_comm_to_comp":0.05,"_adaptive_rate":true,"_rho":0.9,"_epsilon":1.0E-6,"_rate":0.005,"_rate_annealing":1.0E-6,"_rate_decay":1.0,"_momentum_start":0.0,"_momentum_ramp":1000000.0,"_momentum_stable":0.0,"_nesterov_accelerated_gradient":true,"_input_dropout_ratio":0.05,"_hidden_dropout_ratios":[0.1],"_l1":0.0,"_l2":0.0,"_max_w2":3.4028235E38,"_initial_weight_distribution":"UniformAdaptive","_initial_weight_scale":1.0,"_initial_weights":null,"_initial_biases":null,"_loss":"Automatic","_score_interval":0.0,"_score_training_samples":10000,"_score_validation_samples":0,"_score_duty_cycle":0.1,"_classification_stop":0.0,"_regression_stop":1.0E-6,"_quiet_mode":false,"_score_validation_sampling":"Uniform","_diagnostics":true,"_variable_importances":true,"_fast_mode":true,"_force_load_balance":true,"_replicate_training_data":true,"_single_node_mode":false,"_shuffle_training_data":false,"_missing_values_handling":"MeanImputation","_sparse":false,"_col_major":false,"_average_activation":0.0,"_sparsity_beta":0.0,"_max_categorical_features":2147483647,"_reproducible":false,"_export_weights_and_biases":false,"_elastic_averaging":false,"_elastic_averaging_moving_rate":0.9,"_elastic_averaging_regularization":0.001,"_mini_batch_size":1}
07-02 09:27:45.992 192.168.123.5:54321 #7248 FJ-1-51 INFO: Dropping ignored columns: [Close, PCNT_RTN100, High, Low, PCNT_RTN, Time, Open]
07-02 09:27:45.992 192.168.123.5:54321 #7248 FJ-1-51 INFO: Dataset already contains 128 chunks. No need to rebalance.
07-02 09:27:45.993 192.168.123.5:54321 #7248 FJ-1-51 INFO: Starting model DeepLearning__gen_202007020927_m_5_r_2_b_2_pp_0.05_l_1_t_10_model_1
Result of 1st run
07-02 09:27:51.513 192.168.123.5:54321 #7248 #75857-30 INFO: Hyper-Parameter Search Summary (ordered by increasing logloss):
07-02 09:27:51.513 192.168.123.5:54321 #7248 #75857-30 INFO: activation adaptive_rate epsilon hidden hidden_dropout_ratios input_dropout_ratio rho model_ids logloss
07-02 09:27:51.513 192.168.123.5:54321 #7248 #75857-30 INFO: RectifierWithDropout true 1.0E-6 [200] [0.1] 0.05 0.9 DeepLearning__gen_202007020927_m_5_r_2_b_2_pp_0.05_l_1_t_10_model_1 1.7762588168309075
Prepare for Second run
07-02 09:32:49.293 192.168.123.5:54321 #7248 #75857-29 INFO: Starting gridsearch: estimated size of search space = 1
07-02 09:32:49.293 192.168.123.5:54321 #7248 FJ-1-25 INFO: Due to the grid time limit, changing model max runtime to: 1.7976931348623157E308 secs.
07-02 09:32:49.294 192.168.123.5:54321 #7248 FJ-1-25 INFO: Building H2O DeepLearning model with these parameters:
07-02 09:32:49.294 192.168.123.5:54321 #7248 FJ-1-25 INFO: {"_train":{"name":"py_1_sid_aeed","type":"Key"},"_valid":{"name":"py_2_sid_aeed","type":"Key"},"_nfolds":0,"_keep_cross_validation_models":true,"_keep_cross_validation_predictions":false,"_keep_cross_validation_fold_assignment":false,"_parallelize_cross_validation":true,"_auto_rebalance":true,"_seed":-1,"_fold_assignment":"AUTO","_categorical_encoding":"AUTO","_max_categorical_levels":10,"_distribution":"AUTO","_tweedie_power":1.5,"_quantile_alpha":0.5,"_huber_alpha":0.9,"_ignored_columns":["Time","Open","PCNT_RTN","PCNT_RTN100","Low","Close","High"],"_ignore_const_cols":true,"_weights_column":null,"_offset_column":null,"_fold_column":null,"_check_constant_response":true,"_is_cv_model":false,"_score_each_iteration":false,"_max_runtime_secs":1.7976931348623157E308,"_stopping_rounds":5,"_stopping_metric":"mean_per_class_error","_stopping_tolerance":0.001,"_response_column":"ZSRTN","_balance_classes":false,"_max_after_balance_size":5.0,"_class_sampling_factors":null,"_max_confusion_matrix_size":20,"_checkpoint":null,"_pretrained_autoencoder":null,"_custom_metric_func":null,"_custom_distribution_func":null,"_export_checkpoints_dir":null,"_overwrite_with_best_model":true,"_autoencoder":false,"_use_all_factor_levels":true,"_standardize":true,"_activation":"RectifierWithDropout","_hidden":[200],"_epochs":10000.0,"_train_samples_per_iteration":-2,"_target_ratio_comm_to_comp":0.05,"_adaptive_rate":true,"_rho":0.9,"_epsilon":1.0E-6,"_rate":0.005,"_rate_annealing":1.0E-6,"_rate_decay":1.0,"_momentum_start":0.0,"_momentum_ramp":1000000.0,"_momentum_stable":0.0,"_nesterov_accelerated_gradient":true,"_input_dropout_ratio":0.05,"_hidden_dropout_ratios":[0.1],"_l1":0.0,"_l2":0.0,"_max_w2":3.4028235E38,"_initial_weight_distribution":"UniformAdaptive","_initial_weight_scale":1.0,"_initial_weights":null,"_initial_biases":null,"_loss":"Automatic","_score_interval":0.0,"_score_training_samples":10000,"_score_validation_samples":0,"_score_duty_cycle":0.1,"_classification_stop":0.0,"_regression_stop":1.0E-6,"_quiet_mode":false,"_score_validation_sampling":"Uniform","_diagnostics":true,"_variable_importances":true,"_fast_mode":true,"_force_load_balance":true,"_replicate_training_data":true,"_single_node_mode":false,"_shuffle_training_data":false,"_missing_values_handling":"MeanImputation","_sparse":false,"_col_major":false,"_average_activation":0.0,"_sparsity_beta":0.0,"_max_categorical_features":2147483647,"_reproducible":false,"_export_weights_and_biases":false,"_elastic_averaging":false,"_elastic_averaging_moving_rate":0.9,"_elastic_averaging_regularization":0.001,"_mini_batch_size":1}
07-02 09:32:49.295 192.168.123.5:54321 #7248 FJ-1-25 INFO: Dropping ignored columns: [Time, Open, PCNT_RTN, PCNT_RTN100, Low, Close, High]
07-02 09:32:49.295 192.168.123.5:54321 #7248 FJ-1-25 INFO: Dataset already contains 128 chunks. No need to rebalance.
07-02 09:32:49.295 192.168.123.5:54321 #7248 FJ-1-25 INFO: Starting model DeepLearning__gen_202007020932_m_5_r_2_b_2_pp_0.05_l_1_t_10_model_1
Results of second run
07-02 09:32:53.914 192.168.123.5:54321 #7248 #75857-32 INFO: Hyper-Parameter Search Summary (ordered by increasing logloss):
07-02 09:32:53.914 192.168.123.5:54321 #7248 #75857-32 INFO: activation adaptive_rate epsilon hidden hidden_dropout_ratios input_dropout_ratio rho model_ids logloss
07-02 09:32:53.914 192.168.123.5:54321 #7248 #75857-32 INFO: RectifierWithDropout true 1.0E-6 [200] [0.1] 0.05 0.9 DeepLearning__gen_202007020932_m_5_r_2_b_2_pp_0.05_l_1_t_10_model_1 1.7002255980952898

Related

Performance issues of small files on Hive

I was reading an article regarding how small files degrade the performance of the hive query.
https://community.hitachivantara.com/community/products-and-solutions/pentaho/blog/2017/11/07/working-with-small-files-in-hadoop-part-1
I understand the first part regarding overloading the NameNode.
However, what he had said regrading map-reduce doesn't seem to happen. for both map-reduce and Tez.
When a MapReduce job launches, it schedules one map task per block of
data being processed
I don't see mapper task created per file.May the reason is, he is referring the version 1 of map-reduce and so much change haver been done after that.
Hive Version: Hive 1.2.1000.2.6.4.0-91
My table:
create table temp.emp_orc_small_files (id int, name string, salary int)
stored as orcfile;
Data:
following code will create 100 small files it containing only few kb of data.
for i in {1..100}; do hive -e "insert into temp.emp_orc_small_files values(${i}, 'test_${i}', `shuf -i 1000-5000 -n 1`);";done
However I see only one mapper and one reducer task being created for following query.
[root#sandbox-hdp ~]# hive -e "select max(salary) from temp.emp_orc_small_files"
log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
Logging initialized using configuration in file:/etc/hive/2.6.4.0-91/0/hive-log4j.properties
Query ID = root_20180911200039_9e1361cb-0a5d-45a3-9c98-4aead46905ac
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1536258296893_0257)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 7.36 s
--------------------------------------------------------------------------------
OK
4989
Time taken: 13.643 seconds, Fetched: 1 row(s)
Same result with map-reduce.
hive> set hive.execution.engine=mr;
hive> select max(salary) from temp.emp_orc_small_files;
Query ID = root_20180911200545_c4f63cc6-0ab8-4bed-80fe-b4cb545018f2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1536258296893_0259, Tracking URL = http://sandbox-hdp.hortonworks.com:8088/proxy/application_1536258296893_0259/
Kill Command = /usr/hdp/2.6.4.0-91/hadoop/bin/hadoop job -kill job_1536258296893_0259
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-09-11 20:05:57,213 Stage-1 map = 0%, reduce = 0%
2018-09-11 20:06:04,727 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.37 sec
2018-09-11 20:06:12,189 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.36 sec
MapReduce Total cumulative CPU time: 7 seconds 360 msec
Ended Job = job_1536258296893_0259
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.36 sec HDFS Read: 66478 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 360 msec
OK
4989
This is because the following configuration is taking effect
hive.hadoop.supports.splittable.combineinputformat
from the documentation
Whether to combine small input files so that fewer mappers are
spawned.
So essentially Hive can infer that the input is a group of small files smaller than the blocksize and combine them reducing the required number of mappers.

Keras predict gives different error than evaluate, loss different from metrics

I have the following problem:
I have an autoencoder in Keras, and train it for a few epochs. The training overview shows a validation MAE of 0.0422 and an MSE of 0.0024.
However, if I then call network.predict and manually calculate the validation errors, I get 0.035 and 0.0024.
One would assume that my manual calculation of the MAE is simply incorrect, but the weird thing is that if I use an identity model (simply outputs what you input) and use that to evaluate the predicted values, the same error value is returned as for my manual calculation. The code looks as follows:
input = Input(shape=(X_train.shape[1], ))
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(input)
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
decoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
decoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(decoded)
decoded = Dense(X_train.shape[1], activation='sigmoid')(decoded)
network = Model(input, decoded)
# sgd = SGD(lr=8, decay=1e-6)
# network.compile(loss='mean_squared_error', optimizer='adam')
network.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mse'])
# Fitting the data
network.fit(X_train, X_train, epochs=2, batch_size=1, shuffle=True, validation_data=(X_valid, X_valid),
callbacks=[EarlyStopping(monitor='val_loss', min_delta=0.00001, patience=20, verbose=0, mode='auto')])
# Results
recon_valid = network.predict(X_valid, batch_size=1)
score2 = network.evaluate(X_valid, X_valid, batch_size=1, verbose=0)
print('Network evaluate result: mae={}, mse={}'.format(*score2))
x = Input((X_train.shape[1],))
m = Model(x, x)
m.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mse'])
score1 = m.evaluate(recon_valid, X_valid, batch_size=1, verbose=0)
print('Identity evaluate result: mae={}, mse={}'.format(*score1))
errors_test = np.absolute(X_valid - recon_valid)
print("Manual MAE: {}".format(np.average(errors_test)))
errors_test = np.square(X_valid - recon_valid)
print("Manual MSE: {}".format(np.average(errors_test)))
Which outputs the following:
Train on 282 samples, validate on 94 samples
Epoch 1/2
2018-04-18 17:24:01.464947: I C:\tf_jenkins\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
282/282 [==============================] - 0s - loss: 0.0861 - mean_squared_error: 0.0187 - val_loss: 0.0451 - val_mean_squared_error: 0.0025
Epoch 2/2
282/282 [==============================] - 0s - loss: 0.0440 - mean_squared_error: 0.0025 - val_loss: 0.0422 - val_mean_squared_error: 0.0024
Network evaluate result: mae=0.04216482736011769, mse=0.0024067993242382767
Identity evaluate result: mae=0.03506102238563781, mse=0.0024067993242382767
Manual MAE: 0.03506102412939072
Manual MSE: 0.002406799467280507
I know that my manual calculation is correct, since the identity model (m) returns the same value. The only possible explanation for the difference in MAE values would then be if network.evaluate(X_valid, X_valid) somehow uses different values than those returned by network.predict(X_valid), but then the MSE would also be different.
This leaves me completely confused, thinking there might be a bug in the Keras MAE calculation. Has anyone had this issue before or have any ideas how it might be fixed? I am using the Tensorflow backend.
Any help would be much appreciated!
EDIT: I'm almost certain this is a bug. If I keep loss='mae' but also add metrics=['mse', 'mae'], the MAE returned by the metrics is the same as my manual computation and the identity model. The same is true for MSE: if I set loss='mse', the MSE returned by the metric is different from the loss.
It turns out that the loss is supposed to be different than the metric, because of the regularization. Using regularization, the loss is higher (in my case), because the regularization increases loss when the nodes are not as active as specified. The metrics don't take this into account, and therefore return a different value, which equals what one would get when manually computing the error.
The metrics during training and validation are different because of different reasons:
The dataset is different
During trainning the weights are changing in every step so the metrics are changing too
The metric during training is of the current batch of data or a running average of the metrics of the last batches. For the evaluation, the metric is for the whole dataset.

Cassandra - enable row cache then got a lot of "GC Pauses greater than 200 ms"

I have a table with about 1 million record, most reads (over 95%), table schema :
CREATE TABLE ams_table (
projectid text,
tagk text,
tagv text,
metricid bigint,
PRIMARY KEY (projectid, tagk, tagv, metricid)
) WITH CLUSTERING ORDER BY (tagk ASC, tagv ASC, metricid ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
read the table is very frequent, the number of requests in half, duration time for 70%, causing a high CPU usage. so I thought that it's perfect for use row cache.
one record have 4 field, less than 100 byte, so I thought that the entire table does not occupy more than 100MB memory.
so, alter table, rows_per_partition to 'ALL', cache all record.
ALTER TABLE ams.tbl_tags_with_metricid WITH caching = {'keys': 'ALL', 'rows_per_partition': 'ALL' } ;
modify cassandra.yaml, set row_cache_size_in_mb to 128M.
row_cache_size_in_mb : 128
nodetool info, hit rate almost 0.945, until now everything's just fine.
but soon, I saw that the CPU usage is higher.
check system.log, get a lot of GC Pauses greater than 200 ms. like this.
WARN [Native-Transport-Requests-10] 2017-11-22 08:55:39,793 SelectStatement.java:377 - Aggregation query used without partition key
INFO [Service Thread] 2017-11-22 08:55:40,553 GCInspector.java:284 - ParNew GC in 204ms. CMS Old Gen: 11955784424 -> 12032443672; Par Eden Space: 671088640 -> 0; Par Survivor Space: 77459760 -> 83886080
INFO [Service Thread] 2017-11-22 08:55:43,551 GCInspector.java:284 - ParNew GC in 213ms. CMS Old Gen: 12264670920 -> 12341517200; Par Eden Space: 671088640 -> 0;
INFO [Service Thread] 2017-11-22 08:55:45,673 GCInspector.java:284 - ParNew GC in 221ms. CMS Old Gen: 12354581144 -> 12432569616; Par Eden Space: 671088640 -> 0;
INFO [Service Thread] 2017-11-22 08:55:46,186 GCInspector.java:284 - ParNew GC in 217ms. CMS Old Gen: 12248728592 -> 12343417080; Par Eden Space: 671088640 -> 0;
INFO [Service Thread] 2017-11-22 08:55:47,799 GCInspector.java:284 - ParNew GC in 354ms. CMS Old Gen: 11967866640 -> 12058730544; Par Eden Space: 671088640 -> 0;
INFO [Service Thread] 2017-11-22 08:55:48,242 GCInspector.java:284 - ParNew GC in 204ms. CMS Old Gen: 11940028704 -> 11987653704; Par Eden Space: 671088640 -> 0;
cassandra 3.9, deployed in kubernetes container(cpu 8 core, memory 32G), free memory about 10G.
And I try to alter table, 'rows_per_partition' to '3000', so "GC Pauses greater than 200 ms" was disappear, but hit rate is lower and lower, and CPU usage become the same as before.

Pig Latin distinguishing Map or Reduce queries

I have the following data sample:
AGE,EDU,SEX,SALARY
67,10th,Male,<=50K
17,10th,Female,<=50K
40,Assoc-voc,Male,>50K
35,Assoc-voc,Male,<=50K
57,Assoc-voc,Male,<=50K
49,Assoc-voc,Male,>50K
42,Bachelors,Male,>50K
30,Bachelors,Male,>50K
23,Bachelors,Female,<=50K
==============================================
I created the following Pig Latin/hadoop script:
sensitive = LOAD '/mdsba' using PigStorage(',') as (AGE,EDU,SEX,SALARY);
*--Filtered the data by the salary
Data_filter1 = FILTER sensitive by (SALARY matches '<=50K');
Data_filter2 = FILTER sensitive by (SALARY matches '>50K');
--group both filters
B= foreach(group Data_filter1 by(AGE,EDU,SEX))
generate Data_filter1;
C= foreach(group Data_filter2 by(AGE,EDU,SEX))
generate Data_filter2;
Dump B ;
Dump C ;
=============================================================
Is there any way to determine whether the queries B,C, Data_filter1, or Data_filter2 run on Map or Reduce process. Since the following report is generated at the end of the job:
Elapsed: 35sec
Diagnostics:
Average Map Time: 12sec
Average Shuffle Time: 10sec
Average Merge Time: 0sec
Average Reduce Time: 2sec
With many thanks
Yes, when you are launching the job you'll see a string
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: Alias1[73,14] C: Alias2[20, 9] R: Alias3[90, 78]
M stands for mapper, C for combiner, R for reducer. But in general case there is a possibility that your queries will run on both mapper and reducer

Caching not Working in Cassandra

I dont seem to have any caching enabled when checking in Opscenter or cfstats. Im running Cassandra 1.1.7 with Solandra on Debian. I have set the required global options in cassandra.yaml:
key_cache_size_in_mb: 800
key_cache_save_period: 14400
row_cache_size_in_mb: 800
row_cache_save_period: 15400
row_cache_provider: SerializingCacheProvider
Column Families were created as follows:
create column family example
with column_type = 'Standard'
and comparator = 'BytesType'
and default_validation_class = 'BytesType'
and key_validation_class = 'BytesType'
and read_repair_chance = 1.0
and dclocal_read_repair_chance = 0.0
and gc_grace = 864000
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'ALL';
Opscenter shows no data available on caching graphs and CFSTATS doesn't show any cache related fields:
Column Family: charsets
SSTable count: 1
Space used (live): 5558
Space used (total): 5558
Number of Keys (estimate): 128
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 61381
Read Latency: 0.123 ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Postives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 16
Compacted row minimum size: 1917
Compacted row maximum size: 2299
Compacted row mean size: 2299
Any help or suggestions are appreciated.
Sam
The caching stats have been moved from cfstats to info in Cassandra 1.1. If you run nodetool info you should see something like:
Key Cache : size 5552 (bytes), capacity 838860800 (bytes), 38 hits, 47 requests, 0.809 recent hit rate, 14400 save period in seconds
Row Cache : size 0 (bytes), capacity 838860800 (bytes), 0 hits, 0 requests, NaN recent hit rate, 15400 save period in seconds
This is because there are now global caches, rather than per-CF. It seems that Opscenter needs updating for this change - maybe there is a later version available that will work.

Resources