What does a node hour mean in Vertex AI and how do I estimate how many I'll need for a job? - google-cloud-vertex-ai

Much of Vertex AI's pricing is calculated per node hour. What is a node hour and how do I go about estimating how many I'll need for a given job?

A node hour represents the time a virtual machine spends running your prediction job or waiting in a ready state to handle prediction or explanation requests. The cost of one node running for one hour is a node hour.
The price of a node hour varies across regions and by operation.
You can consume node hours in fractional increments. For example, one node running for 30 minutes costs 0.5 node hours.
There are tables in the pricing documentation that can help you estimate your costs, and you can use the Cloud Pricing Calculator.
You can also use Billing Reports to monitor your usage.

Related

Recommend algorithm of fair distributed resources allocation consensus

There are distributed computation nodes and there are set of computation tasks represented by rows in a database table (a row per task):
A node has no information about other nodes: can't talk other nodes and doesn't even know how many other nodes there are
Nodes can be added and removed, nodes may die and be restarted
A node connected only to a database
There is no limit of tasks per node
Tasks pool is not finite, new tasks always arrive
A node takes a task by marking that row with some timestamp, so that other nodes don't consider it until some timeout is passed after that timestamp (in case of node death and task not done)
The goal is to evenly distribute tasks among nodes. To achieve that I need to define some common algorithm of tasks acquisition: when a node starts, how many tasks to take?
If a node takes all available tasks, when one node is always busy and others idle. So it's not an option.
A good approach would be for each node to take tasks 1 by 1 with some delay. So
each node periodically (once in some time) checks if there are free tasks and takes only 1 task. In this way, shortly after start all nodes acquire all tasks that are more or less equally distributed. However the drawback is that because of the delay, it would take some time to take last task into processing (say there are 10000 tasks, 10 nodes, delay is 1 second: it would take 10000 tasks * 1 second / 10 nodes = 1000 seconds from start until all tasks are taken). Also the distribution is non-deterministic and skew is possible.
Question: what kind/class of algorithms solve such problem, allowing quickly and evenly distribute tasks using some sync point (database in this case), without electing a leader?
For example: nodes use some table to announce what tasks they want to take, then after some coordination steps they achieve consensus and start processing, etc.
So this comes down to a few factors to consider.
How many tasks are currently available overall?
How many tasks are currently accepted overall?
How many tasks has the node accepted in the last X minutes.
How many tasks has the node completed in the last X minutes.
Can the row fields be modified? (A field added).
Can a node request more tasks after it has finished it's current tasks or must all tasks be immediately distributed?
My inclination is do the following:
If practical, add a "node identifier" field (UUID) to the table with the rows. A node when ran generates a UUID node identifier. When it accepts a task it adds a timestamp and it's UUID. This easily let's other nodes be able to determine how many "active" nodes there are.
To determine allocation, the node determines how many tasks are available/accepted. it then notes how many many unique node identifiers (including itself) have accepted tasks. It then uses this formula to accept more tasks (ideally at random to minimize the chance of competition with other nodes). 2 * available_tasks / active_nodes - nodes_accepted_tasks. So if there are 100 available tasks, 10 active nodes, and this node has accepted 5 task already. Then it would accept: 100 / 10 - 5 = 5 tasks. If nodes only look for more tasks when they no longer have any tasks then you can just use available_tasks / active_nodes.
To avoid issues, there should be a max number of tasks that a node will accept at once.
If node identifier is impractical. I would just say that each node should aim to take ceil(sqrt(N)) random tasks, where N is the number of available tasks. If there are 100 tasks. The first node will take 10, the second will take 10, the 3rd will take 9, the 4th will take 9, the 5th will take 8, and so on. This won't evenly distribute all the tasks at once, but it will ensure the nodes get a roughly even number of tasks. The slight staggering of # of tasks means that the nodes will not all finish their tasks at the same time (which admittedly may or may not be desirable). By not fully distributing them (unless there are sqrt(N) nodes), it also reduces the likelihood of conflicts (especially if tasks are randomly selected) are reduced. It also reduces the number of "failed" tasks if a node goes down.
This of course assumes that a node can request more tasks after it has started, if not, it makes it much more tricky.
As for an additional table, you could actually use that to keep track of the current status of the nodes. Each node records how many tasks it has, it's unique UUID and when it last completed a task. Though that may have potential issues with database churn. I think it's probably good enough to just record which node has accepted the task along with when it accepted it. This is again more useful if nodes can request tasks in the future.

Time schedule for smart house algorithm

Imagine we have a smart-house and want power up devices in a way to spent less money on electricity.
For every device we know how many hours it should work(continuously) and amount of energy consumed. We assume that each device fulfills one continuous cycle every day.
Moreover we have the maximum amount of electricity that can be consumed in every hour (as a sum of electricity consumed each power up device right now).
Finally we have a cost of electricity for every hour.
Why is the best algorithm for minimizing the money spent on electricity?
Would like to hear any ideas.

how to get the lowest cost when arrange moveable jobs

it's a very hard dynamic programming question, and I want to share with you and we can discuss a little bit toward its solution:
You will put your new application to cloud server; you have to schedule your job in order to get lowest cost. you don't need to care about the number of jobs running at the same time on the same server. every job k is given by a release time sk, a deadline fk, and a duration dk with dk ≤ fk - sk. This job needs to be scheduled for an interval of dk consecutive minutes between time sk and fk. server company would charges per minute per server. You only need one virtual server and you can save money moving jobs from sk to fk around to maximize the amount of time without running any jobs or, in other word, to minimize the amount of time running one or more jobs. using dynamic programming to solve problem. Your algorithm should be polynomial in n, the number of jobs.
This is the problem of minimizing busy time.
See Theorem 17 of this paper:
Rohit Khandekar, Baruch Schieber, Hadas Shachnai, and Tami Tamir. Minimizing busy time in multiple machine real-time
scheduling. In Proceedings of the 30th Annual Conference on Foundations of Software Technology and Theoretical Computer
Science (FSTTCS), pages 169 – 180, 2010
For a description of a polynomial time algorithm.
The key is:
To realize there are only certain interesting times that need to be considered (if you have a schedule, consider delaying each busy interval until you hit a deadline for one of the jobs being processed)
To consider when the longest duration job is done. This splits the problem into two pieces; before and after, which can be solved independently in the normal dynamic programming fashion.

Extract properties of a hadoop job

Given a large datafile and jarfile containing mapper, reducer classes , I want to be able to know , how big Hadoop cluster should be formed ( I mean how many machines I would need to form a cluster for the given job to run efficiently.)
I am running the job on the given datafile(s).
Assuming your MapReduce job scales linearly, I suggest the following test to get a general idea of what you'll need. I assume you have a time in mind when you say "run efficiently"... this might be 1 minute for someone or 1 hour for someone... it's up to you.
Run the job on one node on a subset of your data that fits on one node... or a small number of nodes more preferably. This test cluster should be representative of the type of hardware you will purchase later.
[(time job took on your test cluster) x (number of nodes in test cluster)]
x [(size of full data set) / (size of sample data set)]
/ (new time, i.e., "run efficiently")
= (number of nodes in final cluster)
Some things to note:
If you double the "time job took on test cluster", you'll need twice as many nodes.
If you halve the "new time", i.e., you want your job to run twice as fast, you'll need twice as many nodes.
The ratio of the sample tells you how much to scale the result
An example:
I have a job that takes 30 minutes on a two nodes. I am running this job over 4GB of a 400GB data set (400/4 GB). I would like it if my job took 12 minutes.
(30 minutes x 2 nodes) x (400 / 4) GB / 12 = 500 nodes
This is imperfect in a number of ways:
With one or two nodes, I'm not fully taking into account how long it'll take to transfer stuff over the network... a major part of the mapreduce job. So, you can assume it'll take longer than this estimate. If you can, test your job over 4-10 nodes and scale it from there.
Hadoop doesn't "scale down" well. There is a certain speed limit that you won't be able to cross with MapReduce. Somewhere around 2-3 minutes on most clusters I've seen. That is, you won't be making a MapReduce job run in 3 seconds by having a million nodes.
Your job might not scale linearly, in which case this exercise is flawed.
Maybe you can't find representative hardware. In which case, you'll have to factor in how much faster you think your new system will be.
In summary, there is no super accurate way of doing what you say. The best you can really do right now is experimentation and extrapolation. The more nodes you can do a test on, the better, as the extrapolation part will be more accurate.
In my experience, when testing from something like 200 nodes to 800 nodes, this metric is pretty accurate. I'd be nervous about going from 1 node or 2 nodes to 800. But 20 nodes to 800 might be OK.

Google transit is too idealistic. How would you change that?

Suppose you want to get from point A to point B. You use Google Transit directions, and it tells you:
Route 1:
1. Wait 5 minutes
2. Walk from point A to Bus stop 1 for 8 minutes
3. Take bus 69 till stop 2 (15 minues)
4. Wait 2 minutes
5. Take bus 6969 till stop 3(12 minutes)
6. Walk 7 minutes from stop 3 till point B for 3 minutes.
Total time = 5 wait + 40 minutes.
Route 2:
1. Wait 10 minutes
2. Walk from point A to Bus stop I for 13 minutes
3. Take bus 96 till stop II (10 minues)
4. Wait 17 minutes
5. Take bus 9696 till stop 3(12 minutes)
6. Walk 7 minutes from stop 3 till point B for 8 minutes.
Total time = 10 wait + 50 minutes.
All in all Route 1 looks way better. However, what really happens in practice is that bus 69 is 3 minutes behind due to traffic, and I end up missing bus 6969. The next bus 6969 comes at least 30 minutes later, which amounts to 5 wait + 70 minutes (including 30 m wait in the cold or heat). Would not it be nice if Google actually advertised this possibility? My question now is: what is the better algorithm for displaying the top 3 routes, given uncertainty in the schedule?
Thanks!
How about adding weightings that express a level of uncertainty for different types of journey elements.
Bus services in Dublin City are notoriously untimely, you could add a 40% margin of error to anything to do with Dublin Bus schedule, giving a best & worst case scenario. you could also factor in the chronic traffic delays at rush hours. Then a user could see that they may have a 20% or 80% chance of actually making a connection.
You could sort "best" journeys by the "most probably correct" factor, and include this data in the results shown to the user.
My two cents :)
For the UK rail system, each interchange node has an associated 'minimum transfer time to allow'. The interface to the route planner here then has an Advanced option allowing the user to either accept the default, or add half hour increments.
In your example, setting a' minimum transfer time to allow' of say 10 minutes at step 2 would prevent Route 1 as shown being suggested. Of course, this means that the minimum possible journey time is increased, but that's the trade off.
If you take uncertainty into account then there is no longer a "best route", but instead there can be a "best strategy" that minimizes the total time in transit; however, it can't be represented as a linear sequence of instructions but is more of the form of a general plan, i.e. "go to bus station X, wait until 10:00 for bus Y, if it does not arrive walk to station Z..." This would be notoriously difficult to present to the user (in addition of being computationally expensive to produce).
For a fixed sequence of instructions it is possible to calculate the probability that it actually works out; but what would be the level of certainty users want to accept? Would you be content with, say, 80% success rate? When you then miss one of your connections the house of cards falls down in the worst case, e.g. if you miss a train that leaves every second hour.
I wrote many years a go a similar program to calculate long-distance bus journeys in Finland, and I just reported the transfer times assuming every bus was on schedule. Then basically every plan with less than 15 minutes transfer time or so was disregarded because they were too risky (there were sometimes only one or two long-distance buses per day at a given route).
Empirically. Record the actual arrival times vs scheduled arrival times, and compute the mean and standard deviation for each. When considering possible routes, calculate the probability that a given leg will arrive late enough to make you miss the next leg, and make the average wait time P(on time)*T(first bus) + (1-P(on time))*T(second bus). This gets more complicated if you have to consider multiple legs, each of which could be late independently, and multiple possible next legs you could miss, but the general principle holds.
Catastrophic failure should be the first check.
This is especially important when you are trying to connect to that last bus of the day which is a critical part of the route. The rider needs to know that is what is happening so he doesn't get too distracted and knows the risk.
After that it could evaluate worst-case single misses.
And then, if you really wanna get fancy, take a look at the crime stats for the neighborhood or transit station where the waiting point is.

Resources