How to generate RowId automatically in Hbase MapReduce program

How to generate RowId automatically in Hbase MapReduce program - hadoop

I need to load a dataset file into hbase table.I googled some examples and with that examples i tried reading a file and load it in Hbase. but only the first row is reading.Only one row of data is reading, i need to read all the data , i dont know where i went wrong
I have the file in this format
year class days mm
1964 9 20.5 8.8
1964 10 13.6 4.2
1964 11 11.8 4.7
1964 12 7.7 0.1
1965 1 7.3 0.8
1965 2 6.5 0.1
1965 3 10.8 1.4
1965 4 13.2 3.5
1965 5 16.1 7.0
1965 6 19.0 9.2
1965 7 18.7 10.7
1965 8 19.9 10.9
1965 9 16.6 8.2
please can any one correct me, where i went wrong, i need to load all the data contain in the file, but i can load only first row of data

https://github.com/imyousuf/smart-dao/tree/hbase/smart-hbase/hbase-auto-long-rowid-incrementor/ Did not test, but seems to be what you're looking for.
Also, look Hbase auto increment any column/row-key
Monolitically increasing row keys are not recommended in HBase, see
this for reference: http://hbase.apache.org/book/rowkey.design.html,
p.6.3.2. In fact, using globally ordered row keys would cause all
instances of your distributed application write to the same region,
which will become a bottleneck.

I guess it's because the rowkeys of your table are by default taking the value of the first column which is 'year' so hbase will only read it once since a rowkey cannot be duplicated.
Try to set your rowkey to a different column.

Related

hand coded ETL vs talend open studio

I have developed an ETL with shell scripting .
After that,I've found that there's an existing solution Talend open studio.
I'm thinking of using it in my future tasks.
But my problem is that the files that i want to integrate into the database must be transformed in structure . this is the structure that i have :
19-08-02 Name appel ok hope local merge (mk)
juin nov sept oct
00:00:t1 T1 299 0 24 8 3 64
F2 119 0 11 8 3 62
I1 25 0 2 9 4 64
F3 105 0 10 7 3 61
Regulated F2 0 0 0
FR T1 104 0 10 7 3 61
i must transform it into a flat file format .
Do talend offer me the possibility to do several transformations before integrating data from csvfiles into the databaseor not ?
Edit
this is an example of the flat file that i want to acheive before integrating data to the database (only first row is concerned) :
Timer,T1,F2,I1,F3,Regulated F2,FR T1
00:00:t1,299,119,25,105,0,104
00:00:t2,649,119,225,165,5,102
00:00:t5,800,111,250,105,0,100

We can split the task into three pieces, extract, transform, load.
Extract
First you have to find out how to connect to the source. With Talend its possible to connect to different kinds of sources, like databases, XML files, flat files, csv etc. pp. They are called tFileInput or tMySQLInput to name a few.
Transform
Then you have to tell Talend how to split the data into columns. In your example, this could be the white spaces, although the splitting might be difficult because the field Name is also split by a white space.
Afterwards, since it is a column to row transposition, you have to write some Java code in a tJavaRow component or could alternatively use a tMap component with conditional mapping: (row.Name.equals("T1") ? row.value : 0)
Load
Then the transformation would be completed and your data could be stored in a database, target file, etc. Components here would be called tFileOutput or tOracleOutput for example.
Conclusion
Yes, it would be possible to build your ETL process in Talend. The transposition could be a little bit complicated if you are new to Talend. But if you keep in mind that Talend processes data row by row (as your script does, I assume) this is not that big of a problem.

Need suggestions implementing recursive logic in Hive UDF

We have a hive table that has around 500 million rows. Each row here represents a "version" of the data and Ive been tasked to create table which just contains the final version of each row. Unfortunately, each version of the data only contains a link to his previous version. The actual computation for deriving the final version of the row is not trivial but I believe the below example illustrates the issue.
For example:
id | xreference
----------------
1 | null -- original version of 1
2 | 1 -- update to id 1
3 | 2 -- update to id 2-1
4 | null -- original version of 4
5 | 4 -- update to version 4
6 | null -- original version of 6
When deriving the final version of the row from the above table I would expect the rows with ids 3, 5 and 6 to be produced.
I implemented a solution for this in cascading, which although is correct has an n^2 runtime and takes half a day to complete.
I also implemented a solution using giraffe which worked great on small data sets but I kept running out of memory for larger sets. With this implementation I basically created a vertices for every id and an edge between each id/xreference pair.
We have now been looking into consolidating/simplifying our ETL process and Ive been asked to provide an implementation that can run as a hive UDF. I know that Oracle provides functions for just this sort of thing but I haven't found much in the way of Hive functions.
Im looking for any suggestions for implementing this type of recursion specifically with Hive but Id like to hear any suggestions.
Hadoop 1.3
Hive 0.11
Cascading 2.2
12 node development cluster
20 node production cluster

MonetDB; !FATAL: BBPextend: trying to extend BAT pool beyond the limit (16384000)

Our monetdbd instance throws the error "!FATAL: BBPextend: trying to extend BAT pool beyond the limit (16384000)" after restarting from a normal shutdown (monetdbd start farm works, monetdb start database fails with the given error).
The database contains less than 10 tables and each table has min. 3 fields and max. 22 fields. The overall database size is about 16 GB and a table with 5 fields (3 ints, 1 bigint, 1 date) has 450mil. rows.
Has anyone an idea how to solve that problem without loosing the data?
monetdbd --version
MonetDB Database Server v1.7 (Jan2014-SP1)
Server details:
Ubuntu 13.10 (GNU/Linux 3.11.0-19-generic x86_64)
12 Core CPU (hexacore + ht): Intel(R) Core(TM) i7 CPU X 980 # 3.33GHz
24 GB Ram
2x 120 GB SSD, Software-Raid 1, LVM
Further details:
# wc BBP.dir: "240 10153 37679 BBP.dir"

It sounds strange. What OS and hardware platform?
Are you accidentally using a 32-bit Windows version?

Convert Text File to Sequence File

I am a newbie to Hadoop and Mahout.
I wanted to know how to convert a simple text file containing a set of vectors to sequence file. I have tried the MR framework and changed outputFormat to SequenceFileOutputFormat, and I get following output
SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text��.�U_v�;�Vs�'�sample0 1 2 3 4 5sample1
6 7 8 9 10sample211 12 13 14 15sample316 17 18 19 20
Those hazy characters are in binaries so can't be read but my issue is how to get sample0 1 2 3 4 , similarly others to SequenceFile format (binary format).
I believe it can be done by changing the output of mapper function, however I am unable to figure it out.
-Thanks for your time.

how can i reduce the data fetch time with mongo in a bigger datasize

We have a collection(name_list) of 30 million 'names'. We are comparing this 30 million records with 4 million 'names'. We are fetching these 4 million 'names' from a txt file.
I am using PHP and Linux platform. I gave index for 'names' field. I am using simple 'find' to compare data with mongodb with txt file's data
$collection->findOne(array('names' => $name_from_txt))
I am comparing one by one. I Know join is not possible in mongodb.Is there any better method to compare data in mongodb?
The OS and other details are as follows.
OS : Ubuntu
Kernel Version : 3.5.0-23-generic
64 bit
MongoDB shell version: 2.4.5
CPU info - 24
Memory - 64G
Disks 3 - out of which mongo is written to a fusion i/o disk of size 320G
File system on mongo disk - ext4 with noatime as mentioned in mongo doc
ulimit settings for mongo changed to 65000
readahead is 32
numa is disabled with --interleave option
when i use a script to compare this, it takes around 5 min to complete ... what can be done, so that it gets executed faster and finish in say 1-2 min ? can anyone help please?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio