Is there any feasible and easy option to use a local folder as a Hadoop HDFS folder - hadoop

I have a massive chunk of files in an extremely fast SAN disk that I like to do Hive query on them.
An obvious option is to copy all files into HDFS by using a command like this:
hadoop dfs -copyFromLocal /path/to/file/on/filesystem /path/to/input/on/hdfs
However, I don't want to create a second copy of my files, just to be to Hive query in them.
Is there any way to point an HDFS folder into a local folder, such that Hadoop sees it as an actual HDFS folder? The files keep adding to the SAN disk, so Hadoop needs to see the new files as they are being added.
This is similar to Azure's HDInsight approach that you copy your files into a blob storage and HDInsight's Hadoop sees them through HDFS.

For playing around with small files using the local file system might be fine, but I wouldn't do it for any other purpose.
Putting a file in an HDFS means that it is being split to blocks which are replicated and distributed.
This gives you later on both performance and availability.
Locations of [external] tables can be directed to the local file system using file:///.
Whether it works smoothly or you'll start getting all kinds of error, that's to be seen.
Please note that for the demo I'm doing here a little trick to direct the location to a specific file, but your basic use will probably be for directories.
Demo
create external table etc_passwd
(
Username string
,Password string
,User_ID int
,Group_ID int
,User_ID_Info string
,Home_directory string
,shell_command string
)
row format delimited
fields terminated by ':'
stored as textfile
location 'file:///etc'
;
alter table etc_passwd set location 'file:///etc/passwd'
;
select * from etc_passwd limit 10
;
+----------+----------+---------+----------+--------------+-----------------+----------------+
| username | password | user_id | group_id | user_id_info | home_directory | shell_command |
+----------+----------+---------+----------+--------------+-----------------+----------------+
| root | x | 0 | 0 | root | /root | /bin/bash |
| bin | x | 1 | 1 | bin | /bin | /sbin/nologin |
| daemon | x | 2 | 2 | daemon | /sbin | /sbin/nologin |
| adm | x | 3 | 4 | adm | /var/adm | /sbin/nologin |
| lp | x | 4 | 7 | lp | /var/spool/lpd | /sbin/nologin |
| sync | x | 5 | 0 | sync | /sbin | /bin/sync |
| shutdown | x | 6 | 0 | shutdown | /sbin | /sbin/shutdown |
| halt | x | 7 | 0 | halt | /sbin | /sbin/halt |
| mail | x | 8 | 12 | mail | /var/spool/mail | /sbin/nologin |
| uucp | x | 10 | 14 | uucp | /var/spool/uucp | /sbin/nologin |
+----------+----------+---------+----------+--------------+-----------------+----------------+

You can mount your hdfs path into local folder, for example with hdfs mount
Please follow this for more info
But if you want speed, it isn't an option

Related

Can´t save files using ftp on genexus

I am trying to save a file on a library inside a Iseries database using the GxFtpPut on genexus 10 V3 with .net but when sending the file genexus tries to send it to a windows directory instead of sending it to the library which works using the ftp command on the cmd
I've already tried to changing the route is using to no avail and trying to find another way of sending the file through genexus.
for example when using the cmd I just put this :
put C:\FILES\Filename.txt Library/Filename
And it works on sending the file inside the library,
but when doing this on genexus:
Call("GxFtpPut", &FileDirectory , 'Library/'+&FileName,'B' )
Does not work and tries to find a directory with that name inside the windows files of the server
I just want to be able to send it to the server library without issue.
IBM i has two distinct name formats depending on the file system you are trying to use. NAMEFMT 0 is the library/filename format, and is likely unknown to PC FTP clients. NAMEFMT 1 is the typical hierarchical directory path used by non-IBM i computers, and also works with IBM i if you want to put a file anywhere in the IFS (Integrated File System).
Fun fact, the native library file system is also accessible from the IFS. But to address it you need to use a format that might be a little unfamiliar. /QSYS.lib/library.lib/filename.file/membername.mbr You may be able to drop the member name.
To change name format, you can issue the SITE sub-command on your remote host like this:
QUOTE SITE NAMEFMT 0 -- This sets name format 0 (library/filename)
QUITE SITE NAMEFMT 1 -- This sets name format 1 (directory path)
I did some testing with a plain Windows FTP client. The test file on the PC was a text file created in Notepad++. Turns out that we start out in NAMEFMT 0 unless it is changed. It looks like genexus only supports a limited set of commands. So here is the limited FTP script that works:
ascii
put test.txt mylib/testpf
I can now pull up testpf on the greenscreen utilities and read it. I can also read testpf in my GUI SQL client. The ASCII text has been converted properly to EBCDIC.
|TESTPF |
|--------------------------------------------------------------------------------|
| |
|// ------------------------------------ |
|// Sweep |
|// |
|// Performs the sweep logic |
|// ------------------------------------ |
|dcl-proc Sweep; |
| |
| |
| exec sql |
| update atty a |
| set ymglsb = (select ymglsb from glaty |
| where atty = a.atty) |
| where atty in (select atty from glaty where atty = a.atty); |
|// where ymglsb in (select ymglsb from glaty where atty = a.atty); |
| if %subst(sqlstate: 1: 2) < '00' or |
| %subst(sqlstate: 1: 2) > '02'; |
| exec sql get diagnostics condition 1 |
| :message = message_text; |
| SendSqlMsg('02: ' + message); |
| endif; |
| |
| exec sql |
| update atty a |
| set ymglsb = '000' |
| where not exists (select * from glaty where atty = a.atty); |
| if %subst(sqlstate: 1: 2) < '00' or |
| %subst(sqlstate: 1: 2) > '02'; |
| exec sql get diagnostics condition 1 |
| :message = message_text; |
| SendSqlMsg('03: ' + message); |
| endif; |
| |
|end-proc; |
However, if I try to transfer in binary mode, the resulting data in the file looks like this:
|TESTPF |
|--------------------------------------------------------------------------------|
|ëÏÁÁø&ÁÊÃ?Ê_ËÈÇÁËÏÁÁø% |
|?ÅÑÄÀÄ%øÊ?ÄëÏÁÁøÁÌÁÄËÉ% |
|ÍøÀ/ÈÁ/ÈÈ`/ËÁÈ`_Å%ËÂËÁ%ÁÄÈ`_Å%ËÂÃÊ?_Å%/È` |
|ÏÇÁÊÁ/ÈÈ`//ÈÈ`ÏÇÁÊÁ/ÈÈ`Ñ>ËÁ%ÁÄÈ/ÈÈ`ÃÊ?_Å%/È`ÏÇÁÊÁ/ÈÈ |
|`//ÈÈ`ÏÇÁÊÁ`_Å%ËÂÑ>ËÁ%ÁÄÈ`_Å%ËÂÃÊ?_Å%/È`ÏÇÁÊÁ/ÈÈ`//ÈÈ |
|`ÑöËÍÂËÈËÉ%ËÈ/ÈÁ?ʶËÍÂËÈËÉ%ËÈ/ÈÁ |
|ÁÌÁÄËÉ%ÅÁÈÀÑ/Å>?ËÈÑÄËÄ?>ÀÑÈÑ?>_ÁËË/ÅÁ_ÁËË/ÅÁ¬ÈÁÌÈ |
|ëÁ>ÀëÉ%(ËÅ_ÁËË/ÅÁÁ>ÀÑÃÁÌÁÄËÉ%ÍøÀ/ÈÁ/ÈÈ`/ |
|ËÁÈ`_Å%ËÂÏÇÁÊÁ>?ÈÁÌÑËÈËËÁ%ÁÄÈÃÊ?_Å%/È`ÏÇÁÊÁ/ÈÈ`// |
|ÈÈ`ÑöËÍÂËÈËÉ%ËÈ/ÈÁ?ʶËÍÂËÈËÉ%ËÈ/ÈÁ |
|ÁÌÁÄËÉ%ÅÁÈÀÑ/Å>?ËÈÑÄËÄ?>ÀÑÈÑ?>_ÁËË/ÅÁ_ÁËË/ÅÁ¬ÈÁÌÈ |
|ëÁ>ÀëÉ%(ËÅ_ÁËË/ÅÁÁ>ÀÑÃÁ>ÀøÊ?Ä |
This has not been converted because we have told IBM i FTP server not to convert to EBCDIC because it is binary.
So try ASCII mode, use the library/filename format. The target file does not need to pre-exist.

Efficient way to join by levenshtein in Hive or Impala

I have two tables one includes about 17K (NLIST) records while the other 57K (FNAMES).
I would like to join the both by comparing the records using levenshtein formula.
Here is the example for the content of tables:
Table NLIST:
+------+-------------+
| ID | S_NAME |
+------+-------------+
| 1 | Avi |
| 2 | Moshe |
| 3 | David |
....
Table FNAMES:
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 2 | Dudi |
| 3 | Moshiko |
| 4 | Avi |
| 5 | DAVE |
....
The above tables are just examples. In the real case the names column can include more than one word.
The required result should be:
+------+-------------+--------+
| ID | NICKNAMES | S_NAME |
+------+-------------+--------+
| 1 | Avile | Avi |
| 2 | Dudi | David |
| 3 | Moshiko | Moshe |
| 4 | Avi | Avi |
| 5 | DAVE | David |
...
Here is the code I use:
select FNAMES.NICKNAMES, NLIST.S_NAME
from NICKNAMES
LEFT OUTER JOIN NLIST
ON(true)
WHERE levenshtein (FNAMES.NICKNAMES, NLIST.S_NAME) <=4
The above code runs for a very long time and I stopped its running.
How can I make it run in a reasonable time?
In addition, I think the levenshtein distance depends on the length of the words. How can I find the optimal value for the distance (in this case I chose 4 arbitrarily)?
Hive Table performance is depends upon various point .
Query enginee
File format
use VECTORIZATION set hive.vectorized.execution.enabled = true;set hive.vectorized.execution.reduce.enabled = true;
If you have good server you can try with Impala and definitely it is faster than Hive.
You can do the fine tuning of impala which will give you an edge to execute this query faster .Tuning Impala for Performance

How to Move a whole partition to another tabel on another database?

Database: Oracle 12c
I want to take single partition, or a set of partitions, disconnect it from a Table, or set of tables on DB1 and move it to another table on another database. I would like to avoid doing DML to do this for performance reasons (It needs to be fast).
Each Partition will contain between three and four hundred million records.
Each Partition will be broken up into approximately 300 Sub-Partitions.
The task will need to be automated.
Some thoughts I had:
Somehow put each partition in it's own datafile upon creation, then detaching from the source and attaching it to the destination?
Extract the whole partition (not record-by-record)
Any other non-DML Solutions are also welcom
Example (Move Part#33 from both to DB#2, preferably with a single, operation):
__________________ __________________
| DB#1 | | DB#2 |
|------------------| |------------------|
|Table1 | |Table1 |
| Part#1 | | Part#1 |
| ... | | ... |
| Part#33 | ----> | Part#32 |
| Subpart#1 | | |
| ... | | |
| Subpart#300 | | |
|------------------| |------------------|
|Table2 | |Table2 |
| Part#1 | | Part#1 |
| ... | | ... |
| Part#33 | ----> | Part#32 |
| Subpart#1 | | |
| ... | | |
| Subpart#300 | | |
|__________________| |__________________|
Please read the document below with all the examples of exchanging partitions of table.
https://oracle-base.com/articles/misc/partitioning-an-existing-table-using-exchange-partition

Moving dev site to production on new account AWS

I am in the process of moving and testing a development site on the actual domain name now and I just wanted to check if I was missing anything and also get some advice.
It is a Magento 1.8.1 install from Turnkey Linux running on an m1.medium instance.
What I have done (so far) is, create an image of the development instance, made a new account and copied it over to there. I then made an elastic IP and associated it with the new instance. Next I pointed the A name record of the production domain to the elastic IP.
Now, if I go to the production domain I get redirected to the development domain. Is there a reason for this?
Ideally I would like to have two instances, one dev one that is off unless needed and of course the production on which is going to be live 24/7. However if I turn the development domain off it stops the other too.
I have a feeling it's just because I need to change instances of the dev domain in the Magento database / back-end however I wanted to get a more knowledgable answer as I don't want to break either of the instance.
Also, I should probably mention that the development domain is a subdomain i.e. shop.mysite.com and the live one is just normal i.e. mysite.com. Not entirely sure this is relevant but thought it worth a mention.
Thanks in advance for any help.
The reason your URL on your new instance is getting redirected to the old URL is because in the core_config_data table of your magento database the web/unsecure/base_url and web/secure/base_url paths point to your old URL.
So if you are using mysql you can query your database as follows:
mysql> use magento;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> select * from core_config_data;
+-----------+---------+----------+-------------------------------+-------------------------------------+
| config_id | scope | scope_id | path | value |
+-----------+---------+----------+-------------------------------+-------------------------------------+
| 1 | default | 0 | web/seo/use_rewrites | 1 |
| 2 | default | 0 | admin/dashboard/enable_charts | 0 |
| 3 | default | 0 | web/unsecure/base_url | http://magento.myolddomain.com/ |
| 4 | default | 0 | web/secure/use_in_frontend | 1 |
| 5 | default | 0 | web/secure/base_url | https://magento.myolddomain.com/ |
| 6 | default | 0 | web/secure/use_in_adminhtml | 1 |
| 7 | default | 0 | general/locale/code | en_US |
| 8 | default | 0 | general/locale/timezone | Europe/London |
| 9 | default | 0 | currency/options/base | USD |
| 10 | default | 0 | currency/options/default | USD |
| 11 | default | 0 | currency/options/allow | USD |
| 12 | default | 0 | general/region/display_all | 1 |
| 13 | default | 0 | general/region/state_required | AT,CA,CH,DE,EE,ES,FI,FR,LT,LV,RO,US |
| 14 | default | 0 | catalog/category/root_id | 2 |
+-----------+---------+----------+-------------------------------+-------------------------------------+
14 rows in set (0.00 sec)
and you can change it as follows:
mysql> update core_config_data set value='http://magento.mynewdomain.com' where path='web/unsecure/base_url';
mysql> update core_config_data set value='https://magento.mynewdomain.com' where path='web/secure/base_url';

Magento Indexes Issue - Can't reindex

I have a problem with index management inside my Magento 1.6.2.0 store. Basically I can't get them to update. The status says Processing but it says like that for over a 3 weeks now.
And when I try to reindex I am getting this message Stock Status Index process is working now. Please try run this process later but later is 3 weeks now? So it looks like the process is frozen but I don't know how to restart.
Any ideas?
cheers
Whenever you start an indexing process, Magento writes out a lock file to the var/locks folder.
$ cd /path/to/magento
$ ls var/locks
index_process_1.lock index_process_4.lock index_process_7.lock
index_process_2.lock index_process_5.lock index_process_8.lock
index_process_3.lock index_process_6.lock index_process_9.lock
The lock file prevents another user from starting an indexing process. However, if the indexing request times out or fails before it can complete, the lock file will be left in a lock state. That's probably what happened to you. I'd recommend you check the last modified dates on the lock files to make sure someone else isn't running the re-indexer right now, and then remove the lock files. This will clear up your
Stock Status Index process is working now. Please try run this process later
error. After that, run the indexers one at a time to make sure each one completes.
Hello Did you call script manually if not then create one file in your root folder and write this code in it
require_once 'app/Mage.php';
umask( 0 );
Mage :: app( "default" );
$process = Mage::getSingleton('index/indexer')->getProcessByCode('catalog_product_flat');
$process->reindexAll();
this code do indexing of your magento manually some times it's happen that if your magento store contain large number of products then it will required lot's of time to reindexing of products so when you can go to your index management from admin it will show some indexing in processing stage so may be this code will help you to remove processing stage to ready stage of your indexes.
or you can also do indexing with SSH if you have rights of it. it's faster too for indexing
For newer versions of magento , ie 2.1.3 I had to use this solution:
http://www.elevateweb.co.uk/magento-ecommerce/magento-error-sqlstatehy000-general-error-1205-lock-wait-timeout-exceeded
This might happen if you are running a lot of custom scripts and killing the scripts before the database connection gets chance to close
If you login to MySQL from CLI and run the command
SHOW PROCESSLIST;
you will get the following output
+———+—————–+——————-+—————–+———+——+——-+——————+———–+—————+———–+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read |
+———+—————–+——————-+—————–+———+——+——-+——————+———–+—————+———–+
| | 6794372 | db_user| 111.11.0.65:21532 | db_name| Sleep | 3800 | | NULL | 0 | 0 | 0
|
| 6794475 | db_user| 111.11.0.65:27488 | db_name| Sleep | 3757 | | NULL | 0 | 0 | 0
|
| 6794550 | db_user| 111.11.0.65:32670 | db_name| Sleep | 3731 | | NULL | 0 | 0 | 0
|
| 6794797 | db_user| 111.11.0.65:47424 | db_name | Sleep | 3639 | | NULL | 0 | 0 | 0
|
| 6794909 | db_user| 111.11.0.65:56029 | db_name| Sleep | 3591 | | NULL | 0 | 0 | 0
|
| 6794981 | db_user| 111.11.0.65:59201 | db_name| Sleep | 3567 | | NULL | 0 | 0 | 0
|
| 6795096 | db_user| 111.11.0.65:2390 | db_name| Sleep | 3529 | | NULL | 0 | 0 | 0
|
| 6795270 | db_user| 111.11.0.65:10125 | db_name | Sleep | 3473 | | NULL | 0 | 0 | 0
|
| 6795402 | db_user| 111.11.0.65:18407 | db_name| Sleep | 3424 | | NULL | 0 | 0 | 0
|
| 6795701 | db_user| 111.11.0.65:35679 | db_name| Sleep | 3330 | | NULL | 0 | 0 | 0
|
| 6800436 | db_user| 111.11.0.65:57815 | db_name| Sleep | 1860 | | NULL | 0 | 0 | 0
|
| 6806227 | db_user| 111.11.0.67:20650 | db_name| Sleep | 188 | | NULL | 1 | 0 | 0
+———+—————–+——————-+—————–+———+——+——-+——————+———–+—————+———–+
15 rows in set (0.00 sec)
You can see as an example
6794372 the command is sleep and time is 3800. This is preventing other operations
These processes should be killed 1 by 1 using the command.
KILL 6794372;
Once you have killed all the sleeping connections, things should start working as normal again
You need to do two steps:
give 777 permition to var/locks folders
delete all file of var/locks folder
Whenever you start an indexing process, Magento writes out a lock file to the var/locks folder. So uou need to do two steps:
Give 777 permission to var/locks folders
Delete all file of var/locks folder.
Now refresh the index management page in admin panel.
Enjoy!!

Resources