Why is hbase returning row keys that should have all cells expired - hadoop

We have a table that has a TTL of 30 days. There are no other TTL defined for columns or cells in the table.
hbase(main):001:0> desc 'server_based_data'
Table server_based_data is ENABLED
server_based_data
COLUMN FAMILIES DESCRIPTION
{NAME => 'raw_data', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', TTL => '2592000 SECONDS (30 DAYS)'
, KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
However, when I run a scan on it, it seems that it's finding/running into rowkeys that should have all cells deleted due to the TTL expiring. I know this because the timestamp is a part of the key, e.g. hash_servername_timestamp or 65_app129041.iad1.mydomain.com_1476641940
1476641940 is Oct 16, 2016. I performed the scan for 6/26/2017 from 3pm-4pm
This is the snippet of my scan object/code to get the results to process.
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("raw_data"), Bytes.toBytes(fileType));
scan.setCaching(500);
scan.setCacheBlocks(false);
scan.setTimeRange(start, end);
TableMapReduceUtil.initTableMapperJob(tableName, scan, MTTRMapper.class, Text.class, IntWritable.class, job);
My job fails (I think) because too many map jobs fail/are killed. Below is a snippet from one of the logs of the failed map jobs.
Error: org.apache.hadoop.hbase.client.RetriesExhaustedException:
Failed after attempts=36, exceptions: Wed Jun 28 13:46:57 PDT 2017,
null, java.net.SocketTimeoutException: callTimeout=120000,
callDuration=120301: row '65_app129041.iad1.mydomain.com_1476641940'
on table 'server_based_data' at region=server_based_data
I tried manually getting the contents of a few of the rows but nothing returns (as expected)
get 'server_based_data', '65_app129041.iad1.mydomain.com_1476641940'
COLUMN CELL
0 row(s) in 0.2900 seconds
My questions are:
Why are these keys showing up in the scan? They shouldn't show up because they don't fall between the start and end times and because the TTL has expired.
How do I debug where these keys are coming from since getting the rowkey from the table yields 0 rows returned.
The problem rows aren't the same from test run to test run. I ran my program back to back to back and each time, the problem rows are different. There are some that are the same between each of the runs but they are not 100% the same.

Related

Laravel Horizon Restrictions and Optimization

Is there any rule of thumb or any logical relation between maxProcesses, number of supervisors and the total number of queues in laravel horizon?
What if I have 15 supervisors and 40 queues (each supervisor has multiple queues based on their category)? What is the maximum number of maxProcesses I can assign to each supervisor (suppose balancing auto)?
I want to know that if there's a rule of thumb for a better performance on horizon by tuning these numbers, for example if the number of supervisor-x should not exceed the total number of queues and if the maxProcesses should not exceed a certain number based on the OS spec running the processes.
Is there any logical relation between these numbers? Is there a good document about this issue? I have seen this document on supervisor and also the Laravel Horizon docs, but have not found the answer to my questions.
I need to explain things in detail in order to understand the relation between all these things.
Supervisor exists out of some simple settings. The most important once are these:
[program:laravel-worker]
process_name=%(program_name)s_%(process_num)02d
command=php /home/forge/app.com/artisan queue:work
autostart=true
autorestart=true
numprocs=8
The most important setting here is numprocs=8, from supervisor the manual it says:
Supervisor will start as many instances of this program as named by numprocs. Note that if numprocs > 1, the process_name expression must include %(process_num)s (or any other valid Python string expression that includes process_num) within it.
This configuration of supervisor running a program called artisan queue:work will create 8 instances (processes, workers, the same thing) of artisan queue:work. This means that 8 jobs can be processed simultaneously, nothing more, nothing less.
Horizon doesn't define the numprocs, the only important setting you'll have to know is the stopwaitsecs=3600. This should always be far greater than the maximum time a job runs in your entire application. Here the absolute maximum amount would be 60 minutes.
Now Horizon comes with a balancing strategy where you can define the min and max number of processes (workers) and it's strategy using
'balance' => 'auto',
'minProcesses' => 1,
'maxProcesses' => 10,
What Horizon offers to do here is scale up or down the amount of processes (workers) according to the amount of workload present in the queue(s).
If you define a supervisor configuration like the following:
'environments' => [
'production' => [
'supervisor-1' => [
'connection' => 'redis',
'queue' => ['default', 'events', 'xls', 'whatever'],
'balance' => 'auto',
'minProcesses' => 10,
'maxProcesses' => 40,
'balanceMaxShift' => 1,
'balanceCooldown' => 3,
'tries' => 3,
],
],
],
Then all 4 queues, default, events, xls and whatever run all under the same conditions, will have a total of 40 workers available and a minimum of 10. So not each queue has 40 workers available, but all combined have 40 workers (processes) available.
The key point here of getting a good scale for each queue to work optimally, is to divide them into different categories, e.g.
short-load -> each job takes about 1 to 5 seconds.
medium-load -> each job takes about 5 to 30 seconds.
long-load -> each job takes up to 5 minutes.
extreme-load -> each job takes longer than 5 minutes, up to an hour.
If you only end up with two scenarios, like short-load and long-load, then you will have two configurations for horizon in such a way which would define how fast supervisor will respond to spawn new workers and how many times it will try to repeat a job if it has failed (where you seriously don't want to try a job that will fail each time after 59 minutes 3 times).
'environments' => [
'production' => [
'supervisor-1' => [
'connection' => 'redis',
'queue' => ['default', 'events'],
'balance' => 'auto',
'minProcesses' => 10,
'maxProcesses' => 40,
'balanceMaxShift' => 10,
'balanceCooldown' => 1,
'tries' => 3,
],
'supervisor-long-run' => [
'connection' => 'redis',
'queue' => ['xls', 'whatever'],
'balance' => 'auto',
'minProcesses' => 1,
'maxProcesses' => 10,
'balanceMaxShift' => 1,
'balanceCooldown' => 3,
'tries' => 1,
],
],
],
In one of your last comments you asked
I want to understand all those calculations you make, what's the formula for it
The formula is, 1 supervisor instance can have many queues, and all of these queues have a maximum amount of workers available. The queues are not that important, but the amount of jobs (and the kind of jobs) placed in these queues in a certain amount of time is.
Example:
4 queues producing 120 jobs each minute, need x amount of workers to be processed. If you scale up (or down) the amount of workers (processes), the amount of time it takes to process all these jobs until the queues are empty relates to the amount of workers you make available.
If you have 10 workers available, then 10 jobs will be processed simultaneously.
If you have 120 workers available, then 120 jobs will be processed simultaneously.
If 1 job takes 10 seconds to complete (as an example average) and an average of 120 jobs are put on a queue each minute. If you would like to process (clear the queue) all jobs within one minute, you need 120 jobs * 10 seconds for each job / 60 seconds in a minute = the amount of workers (processes) needed to complete all those jobs within 1 minute.
So yes, you can scale up the amount of workers to 64, 512 or 24890. It comes all back to the question how much load can your hardware handle.
Hope it made sense.
I'll clean up the text tomorrow using only workers, processes or instances .. it's such a mess ;)

logstash schedules - is it possible to splay the time so schedule does not start on second 1 of every minute?

simple question here with maybe a complex answer? I have several logstash docker containers running on the same host using the JDBC plugin. Each of them does work every minute. For example:
input {
jdbc {
jdbc_driver_library => "/usr/share/logstash/bin/mysql-connector-java-8.0.15.jar"
jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
# useCursorFetch needed cause jdbc_fetch_size not working??
# https://discuss.elastic.co/t/logstash-jdbc-plugin/84874/2
# https://stackoverflow.com/a/10772407
jdbc_connection_string => "jdbc:mysql://${CP_LS_SQL_HOST}:${CP_LS_SQL_PORT}/${CP_LS_SQL_DB}?useCursorFetch=true&autoReconnect=true&failOverReadOnly=false&maxReconnects=10"
statement => "select * from view_elastic_popularity_scores_all where updated_at > :sql_last_value"
jdbc_user => "${CP_LS_SQL_USER}"
jdbc_password => "${CP_LS_SQL_PASSWORD}"
jdbc_fetch_size => "${CP_LS_FETCH_SIZE}"
last_run_metadata_path => "/usr/share/logstash/codepen/last_run_files/last_run_popularity_scores_live"
jdbc_page_size => '10000'
use_column_value => true
tracking_column => 'updated_at'
tracking_column_type => 'timestamp'
schedule => "* * * * *"
}
}
Notice the schedule is * * * * *? That's the crux. I have a box that's generally idle for 50 seconds out of every minute, then it's working its ass off for x seconds to process data for all 10 logstash containers. What'd be amazing is if I could find a way to splay the time so that the 10 containers work on independent schedules, offset by x seconds.
Is this just a dream? Like world peace, or time away from my kids?
Thanks
I believe rufus cronlines (which is what the schedule option is) can specify seconds.
'13 0 22 * * 1-5' means every day of the week at 22:00:13

logstash cron schedule to run every 12 hours starting at certain time

I am trying 0 1/12 * * * cron expression but it only fires once a day. Below is 1 of my configuration.
input {
jdbc {
jdbc_connection_string => "jdbc:redshift://xxx.us-west-2.redshift.amazonaws.com:5439/xxx"
jdbc_user => "xxx"
jdbc_password => "xxx"
jdbc_validate_connection => true
jdbc_driver_library => "/mnt/logstash-6.0.0/RedshiftJDBC42-1.2.10.1009.jar"
jdbc_driver_class => "com.amazon.redshift.jdbc42.Driver"
schedule => "0 1/12 * * *" #01:00,13:00, tried from https://crontab.guru/#0_1/12_*_*_*
statement_filepath => "conf/log_event_query.sql"
use_column_value => true
tracking_column => dw_insert_dt
last_run_metadata_path => "metadata/logstash_jdbc_last_run_log_event"
}
}
output {
elasticsearch {
index => "logs-ics_%{+dd_MM_YYYY}"
document_type => "log_event"
document_id => "%{log_entry_id}"
hosts => [ "x.x.x.x:xxxx" ]
}
}
I also tried below 0 0 1/12 ? * * * from https://www.freeformatter.com/cron-expression-generator-quartz.html but lostash does not support this type.
Original cron used.
Please help me get a cron expression which works in logstash according to following dates and also is there a online page where I can test my future logstash cron expressions?
1st at 2018-08-01 01:00:00
then at 2018-08-01 13:00:00
then at 2018-08-02 01:00:00
then at 2018-08-02 13:00:00
then at 2018-08-03 01:00:00
It looks like your scheduling format is wrong.
To do a once-every-twelve hours task, you would use */12, not 1/12:
0 */12 * * * # Every twelve hours at minute 0 of the hour.
Your schedule looks more like an attempt to run the task at one and 12, but to do that you would use a comma, like this:
0 1,12 * * * # Run at one and twelve hours at minute 0.
The rufus extension also allows for adding the timezone (like Asia/Kuala_Lumpur), if you need this to run scheduled on a specific timezone that is not the default machine clock.
Your code above doesn't show us the SQL query you are running. The query could be firing, but if there are no results from the query, you aren't going to get any input in logstash. In any case, your scheduling syntax needs to change from 1/12 to */12 to do what you want it to.
More generally, according to the logstash jdbc input plugin documentation, the scheduling format is considered to be cron-"like." The logstash jdbc input plugin uses the ruby Rufus scheduler. The docs on that scheduling format are here: https://github.com/jmettraux/rufus-scheduler#parsing-cronlines-and-time-strings
Logstash 6.0 JDBC plugin docs are here: https://www.elastic.co/guide/en/logstash/6.0/plugins-inputs-jdbc.html
Hope this helps.

How to get all the version of hbase row

I am trying to do the following command in hbase:
scan 'testLastVersion' {VERSIONS=>8}
And it return only the last version of the row.
Do you know how can I get all the versions of row through command shell and through java code?
Thanks!
I think you are missing the ',' there.. The command should be something like this:
scan 'emp', {VERSIONS=>8}
Even if you are missing the comma, HBase should throw an error:
SyntaxError: (hbase):16: syntax error, unexpected tLCURLY
I tried to simulate a your scenario and got all the results. Please find them below.
hbase(main):010:0> put 'emp', '1', 'personal_data:name', 'Ajay'
0 row(s) in 0.0220 seconds
hbase(main):012:0> put 'emp', '1', 'personal_data:name', 'Vijay'
0 row(s) in 0.0140 seconds
hbase(main):014:0> put 'emp', '1', 'personal_data:name', 'Ceema'
0 row(s) in 0.0070 seconds
hbase(main):017:0> scan 'emp', {VERSIONS=>3}
ROW COLUMN+CELL
1 column=personal_data:name, timestamp=1472651320449, value=Ceema
1 column=personal_data:name, timestamp=1472651313396, value=Vijay
1 column=personal_data:name, timestamp=1472651300718, value=Ajay
1 row(s) in 0.0220 seconds

Laravel 5 and Carbon discrepancy on Forge

Hopefully I'm not mad and I'm only missing something. I have a project on Laravel 5.0 and I have a requestExpired function called every time I have an incoming request. Now, to calculate the difference between current time on the server and the timestamp within the request I'm using:
$now = Carbon::now('UTC');
$postedTime = Carbon::createFromTimestamp($timestamp, 'UTC');
For some reason request is always rejected because it's expired. When I debug these two lines from above and just dump data, I get:
REQUEST'S TIMESTAMP IS: 1423830908279
$NOW OBJECT: Carbon\Carbon Object
(
[date] => 2015-02-13 12:35:08.000000
[timezone_type] => 3
[timezone] => UTC
)
$POSTEDTIME OBJECT: Carbon\Carbon Object
(
[date] => 47089-05-28 09:37:59.000000
[timezone_type] => 3
[timezone] => UTC
)
Any ideas why $postedTime is so wrong? Thanks!
To answer my own question: for some strange reason webhook calls from remote API have 13 digits long timestamps and that's why my dates were so wrong.

Resources