Stack, Queue, Tree? - data-structures

13,5,2,9,8 and 1 are inserted into data structure in that order.
An item is deleted using only a basic data structure operation. If the deleted item is 1, the data structure cannot be?
Priority queue
Stack
Queue
Search tree
Thanks in advance!

It cannot be a queue, because a queue only supports "popping" the oldest item (FIFO: first in first out), and 1 is inserted last.

Related

Is there any cap on number of tuples generated out of one tuple from a Apache Storm Bolt?

One of our Bolt breaks a large tuple message down into its children and then emits these children as tuples. There can be 10000 children some times.
This bombardment of tuples, choke down our topology.
Is there any cap/ceiling value on the number of tuples generated out of one tuple in a Bolt?
We need to send these children further down the topology so that state of these children can be updated according to state of parent.
There is a cap where Storm's algorithm for tracking tuples breaks down, but that cap is at the point where you start to see common collisions between 64-bit random values. So no, there effectively isn't a cap.
What you might run into is that it takes too long to process all the child tuples, so the whole tuple tree hits the tuple timeout. Either you can increase the timeout, or you can detour over e.g. Kafka so the entire processing doesn't have to happen within a single tuple tree's processing time.
A setup like
Topology A: source -> splitter -> Kafka
Topology B: Kafka -> processing
lets you process each child individually instead of having to handle all 10k tuples within the message timeout of the parent.
Please elaborate if you meant something else by your topology being choked.

is not the benefit of B-Tree lost when it is saved in File?

I was reading about B-Tree and it was interesting to know that it is specifically built for storing in secondary memory. But i am little puzzled with few points:
If we save the B-Tree in secondary memory (via serialization in Java) is not the advantage of B-Tree lost ? because once the node is serialized we will not have access to reference to child nodes (as we get in primary memory). So what that means is, we will have to read all the nodes one by one (as no reference is available for child). And if we have to read all the nodes then whats the advantage of the tree ? i mean, in that case we are not using the binary search on the tree. Any thoughts ?
When a B-Tree is used on disk, it is not read from file, deserialized, modified, and serialized, and written back.
A B-Tree on disk is a disk-based data structure consisting of blocks of data, and those blocks are read and written one block at a time. Typically:
Each node in the B-Tree is a block of data (bytes). Blocks have fixed sizes.
Blocks are addressed by their position in the file, if a file is used, or by their sector address if B-Tree blocks are mapped directly to disk sectors.
A "pointer to a child node" is just a number that is the node's block address.
Blocks are large. Typically large enough to have 1000 children or more. That's because reading a block is expensive, but the cost doesn't depend much on the block size. By keeping blocks big enough so that there are only 3 or 4 levels in the whole tree, we minimize the number of reads or writes required to access any specific item.
Caching is usually used so that most accesses only need to touch the lowest level of the tree on disk.
So to find an item in a B-Tree, you would read the root block (it will probably come out of cache), look through it to find the appropriate child block and read that (again probably out of cache), maybe do that again, finally read the appropriate leaf block and extract the data.

Difference of consecutive elements in a list using mapreduce

I have a list of numbers and want to compute the difference of consecutive numbers in that list. I'm working on RDDs in Apache Spark.
Example:
Input: [1,2,5,7,8,10,13,17,20,20,21]
Output: [1,3,2,1,2,3,4,3,0,1]
I'm wondering if this is possible using the mapreduce paradigm without duplicating the input RDD.
You can use org.apache.spark.mllib.rdd.RDDFunctions.sliding.
Returns a RDD from grouping items of its parent RDD in fixed size blocks by passing a sliding window over them. The ordering is first based on the partition index and then the ordering of items within each partition. This is similar to sliding in Scala collections, except that it becomes an empty RDD if the window size is greater than the total number of items. It needs to trigger a Spark job if the parent RDD has more than one partitions and the window size is greater than 1.

What is HBase compaction-queue-size at all?

Any one knows what regionserver queue size is meant?
By doc's definition:
9.2.5. hbase.regionserver.compactionQueueSize Size of the compaction queue. This is the number of stores in the region that have been
targeted for compaction.
It is the number of Store(or store files? I have heard two version of it) of regionserver need to be major compacted.
I have a job writing data in a hotspot style using sequential key(non distributed).
and I saw inside the metric history discovering that at a time it happened a compaction-queue-size = 4.
That's theoretically impossible since I have only one Store to write(sequential key) at any time.
Then I dig into the log ,found there is any hint about queue size > 0:
Every major compaction say "This selection was in queue for 0sec"
013-11-26 12:28:00,778 INFO
[regionserver60020-smallCompactions-1385440028938]
regionserver.HStore: Completed major compaction of 3 file(s) in f1 of
myTable.key.md5....
into md5....(size=607.8 M), total size for
store is 645.8 M. This selection was in queue for 0sec, and took 39sec
to execute.
Just more confusing is : Isn't multi-thread enabled at earlier version and just allocate each compaction job to a thread ,by this reason why there exists compaction queue ?
Too bad that there's no detail explanation in hbase doc.
I don't fully understand your question. But let me attempt to answer it to the best of my abilities.
First let's talk about some terminology for HBase. Source
Table (HBase table)
Region (Regions for the table)
Store (Store per ColumnFamily for each Region for the table)
MemStore (MemStore for each Store for each Region for the table)
StoreFile (StoreFiles for each Store for each Region for the table)
Block (Blocks within a StoreFile within a Store for each Region for the table)
A Region in HBase is defined as the Rows between two row key's. If you have more than one ColumnFamily in your Table, you will get one Store per ColumnFamily per Region. Every Store will have a MemStore and 0 or more StoreFiles
StoreFiles are created when the MemStore is flushed. Every so often, a background thread will trigger a compaction to keep the number of files in check. There are two types of compactions: major and minor. When a Store is targeted for a minor compaction, it will also pick up some adjacent StoreFiles and rewrites them as one. A minor compaction will not remove deleted/expired data. If a minor compaction picks up all StoreFiles in a Store, it's promoted to a major compaction. In a major compaction, all StoreFiles of a Store are rewritten as one StoreFile.
Ok... so what is a Compaction Queue?
It is the number of Stores in a RegionServer that have been targeted for compaction. Similarly a Flush Queue is the number of MemStores that are awaiting flush.
As to the question of why there is a queue when you can do it asynchronously, I have no idea. This would be a great question to ask on the HBase mailing list. It tends to have faster response times.
EDIT: The compaction queue is there to not take up 100% of the resources of a RegionServer.

How to write multiple records on a single Hadoop node

I need help for a Hadoop problem.
In my Java system, I have a function that creates n records. Each record obviously is a row to write in a text file in Hadoop.
The problem is:
How can I save all the n records in the same Hadoop node? In other words, I want that the n records are seen like a unique record, to be sure that if one of these records (or one of its replica) is on a node, then of course the other n-1 records are also on the same node.
For example, suppose that my function creates:
record1: 5 los angeles rainy
record2: 8 new york sunny
record3: 2 boston rainy
When I append this three records (three rows) in the text file of Hadoop, it can happen that record1 goes to node1, record2 goes to node2 and record3 goes to node3. I want to know if there is a way to be sure that all three records are stored on the same node, for example node2, and that they are not stored on different nodes.
Thank you for your attention.
Hadoop will partition the tuples based on the default HashPartitioner and send the tuples with the same key to a single reducers for aggregations. If the default HashPartitioner doesn't fit the requirement then a custom partitioner can be written. Here is the code for the HashPartitioner in the trunk.
Another way is to emit the keys from the mapper as per the partition strategy and the HashPartitioner will send all the tuples with the same key to one of the reducer.
Also, think at a Map and Reduce level abstraction and not a node level. Hadoop tries to hide the network topology of the cluster.
By setting your parallelism to one. That means by specifying your number of reducers to one. Then all your records would get written into one part file. But the downside is your job takes much longer time to complete.

Resources