User rok uploaded file and set the permission to 770. The file on HDFS looks like this:
-rw-rw---- 3 rok hdfs filename1
I'm using ksc user to consume the data uploaded by rok user. So first, I'd like to make sure that ksc has permission for that file filename1.
How do I find out the group name of my user ksc? Does user belong to hdfs group in Hadoop?
BTW, if I upload a file to Hadoop, the file permission looks like:
-rw-r--r-- 3 ksc ksc filename2
The local info on my Linux of ksc user is :
uid=504(ksc) gid=502(ksc) groups=502(ksc)

Use the command below:
$hdfs groups ksc
It gives all of the groups user ksc belongs to.

HDFS follows the traditional style of Linux file system permssions. To determine the group of ksc, use groups ksc if you are on Linux.
-rw-rw---- 3 rok hdfs filename1 will give you read/write permissions only if you are part of the hdfs group. Judging from your output, I'm thinking you're not.
You will need to do one of the following:
Change rok's file permissions to 664 (read permissions for all users), which is pretty insecure
Have ksc added to the hdfs group, more secure
The choice is yours...
Consult the following links for more information:

The way that Hadoop maps users to groups is configurable, so HDFS groups may not be the same as the Unix groups. Also note that if your Hadoop configuration does use the Unix user-group mappings, it will use the unix mappings on the NameNode. Also note that the NameNode caches the mappings for a period of time, so any changes you make may not be available until the cache is expired/refreshed.
As for checking, in addition to what is already mentioned you can check the actual system file that contains the mappings like this if you have root access:
grep <user or group> /etc/group
More here:


Load a folder from LocalSystem to HDFS

I have a folder in my LocalSystem. It contains 1000 files, and I would move or copy him from my LocalSystem to HDFS
I tried by these two commands:
hadoop fs copyFromLocal C:/Users/user/Downloads/ProjectSpark/ling-spam /tmp
And I also tried this command:
hdfs dfs -put /C:/Users/user/Downloads/ProjectSpark/ling-spam
It displays an error message which says that my directory not found and yet I'm sure that correct.
I found a function getmerge() to move a folder from HDFS to LocalSystem, but I did not find the inverse.
Please, can you help me?
my VirtualBox on Windows, and i work by HDP2.3.2 with the console secure shell
You can't copy files from your Windows machine to HDFS. You have to first SCP the files into the VM (I recommend WinSCP or Filezilla) and only then can you use hadoop fs to put files onto HDFS.
The error was correct in that C:/Users/user/Downloads does not exist on the HDP sandbox because it's a Linux machine.
As noted, you can also try and use the Ambari HDFS file viewer, but I still standby by note that SCP is the official way because not all Hadoop systems have Ambari (or at least the HDFS file view for Ambari)
I would take the Mutual Information for classification of the word spam or ham. I have this operation: MI(Word)= ∑ Probabi(Occ,Class) * Log2 * (Probabi(Occuren,Class)/Probabi(Occurren) * Probabi(Class)).
I understand the function, I must compute 4 operation (true,ham), (false,ham), (true,spam) and (false,spam).
I do not understand who i do write exactly, in fact, I computed the number of the file in which in occur.
But I do not who exactly I must write in my function.
Thank you very much!
This isthe corps of my function:
def computeMutualInformationFactor(
probaWC:RDD[(String, Double)],// probability of occurrence of the word in a given class.
probaW:RDD[(String, Double)],// probability of occurrence of the word in whether class
probaC: Double, //probability an email appears in class (spam or ham)
probaDefault: Double // default value when a probability is missing
):RDD[(String, Double)] = {

how do I find home directories that are writable by group or other?

I am really new to Bash Scripting so please bear with me if this question sounds stupid. I am also not too sure what to search on the internet.What should I do if I need to write a shell script to list any directory where one user's home directory can be modified by some other user? I am not able to understand what this 'modified by some other user means'.Please help. Thanks !
The very short answer to your question is: no script needed, simply:
ls -al /home
That will list for you all users and the respective permissions for each users home directory. Linux file permission are controlled by 10 bits that represent who has access and what, if any, special permissions are associated with a given file. The permissions bits are usually represented for discussion as drwxrwxrwx. The first, or special, bit meaning is as follows:
_: (unset) indicates a regular file with no special properties
d: directory,
l: link,
s: the directory is setuid/setgid
t: sticky bit
The next nine bits rwxrwxrwx (3 sets of rwx) control the access the owner group world has to the file in question. So who is the owner group or world? Let's look at an example from ls -al /home:
drwxr-xr-x 15 deborah users 4096 Mar 11 2011 deborah
Looking at the information we can separate the 10 bits and information as follow:
d rwx r-x r-x .. deborah users ..... deborah
| | | \ \ \
owner | world owner group filename
Above the special permission bit is a d which indicates that the filename (at the far right deborah) is a directory. The first set of 3 bit specifies that the owner (deborah) has read, write and execute permission on the file. Similarly, the next set of 3 specify that the group (users) has read and and execute permission but no write permission. NOTE: with a directory, the execute bit also control whether the (owner, group or world) can descend into the directory. In like manner, the world (everybody) has the same permission as group (users).
To manipulate the bits, you use the chmod (change mode) command. To manipulate the user or group, you use the chown (change owner) command. The chown command has simple basic usage, just specify the new owner and group separated by a colon :. For example to change the file shown above to be owned by user david and group samba the command would be chown david:samba filename
There are two ways to change the permissions or (mode) with chmod. You either specify the octal equivalent for special bit and the 3 sets of owner, group and world bits at once numerically. Example: to make the directory rwx for the user and group you would issue the command:
chmod 0775 filename # to set all permissions as desired at once
The 0 simply stating no special bit settings for the directory, the first 7 indicating the binary 111 (or rwx) for the user, the second 7 indicating the same for the group and the final 5 indicating the world should have (binary 101) r_x permissions. While not always required, it is recommended to provide the leading 0 even when there will be no change to the special permission bit to remove any ambiguity.
You can also use chmod with +/-/= r, w, x (for corresponding rwx bits) for u, g, or o user, group, or owner permissions (you can shorcut using a for all). To put it all together and set the mode the same as shown above using octal bit, you would simply do:
chmod g+w filename # to add the single write bit to group 'users'
Using this method, you may be required to make multiple calls to chmod to set all permission as required, but contrast using the octal permissions, you can set all permission fields in a single call.
Obviously there is much more to it than this, but for a good introduction, this should be enough to get you started managing permissions and ownership. (obviously this post also turned out way longer than initially anticipated, enjoy).
where one user's home directory can be modified by some other user?
can be:
if the user1 is in the same group as user2 AND the home directory is group-writable, or
if the user has world-writable directory
You really need understand how unix-like permissions works. (or in wider context - how ACLs works in general)
For the (partial) solution (many ways - one of them is the next):
you can get the path of home directories from the /etc/passwd file.
can read them in a cycle, (filter the /etc/passwd with the cut command), and
test, if they're writable for you (for this, read the man page about the shell builtins if and the command test alias [.

Loading new files using Pig LOAD statement

I wanted to load data from HDFS to HBSE table sing PIG script.
I have hadfs folder structure as below:
-rw-r--r-- 1 user supergroup 63 2014-05-15 20:28 dataparse/good/goodrec_051520142028
-rw-r--r-- 1 user supergroup 72 2014-05-15 20:30 dataparse/good/goodrec_051520142030
-rw-r--r-- 1 user supergroup 110 2014-05-15 20:32 dataparse/good/goodrec_051520142032
In the above all filenames are attached with the timestamp.
Below is my PIG script to load from HDFS to HBASE:
G = LOAD '/user/user/dataparse/good/' USING PigStorage(',') as (c1:chararray, c2:chararray,c3:chararray,c4:chararray,c5:chararray);
STORE G INTO 'hbase://test' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('t1:name t1:state t1:phone_no t1:gender');
The script is working fine and the data from all the 3 files are written to the Hbase "test" table.
Suppose after some time if some more files comes to HDFS with the same structure and when i run the pig script it will LOAD all the files in the "good" directory along with the already read file. So how can i load only those files which are new files. Already loaded files should not be loaded again into my HBASE table.
How can i do this?
I think you have a few options here.
Using globs
Using a shell script pick up the "new" files, Use the glob feature so
that multiple files can be fed into the script. A related use case is
If the files have a date and timestamp in the filename then you can
use globs directly, look here to inspiration
Using big guns
If using globs is failing you, then you need to bring out the big
guns, use a custom load function put in the logic to identify "new
files" in it and you should be good to go. Details here
you need to have some scheduling mechanism where pig job runs time to time. So, in this process you can only process the files which are not processed earlier by keep traking the timestamp and file names or any other field.
See here for more information Execute Pig from within Java Application

Is there an apache pig equivalent of "SHOW TABLES"?

I have a Hadoop data store I'm accessing in Pig and not a lot of documentation on it, plus I'm new to Pig, so I am looking for the Pig equivalent of "SHOW TABLES". When I have a connection to a MySQL db I can do this and get a general sense of what data is in there; I have found several tutorials but nothing on point. If not, is there some other way to orient myself to a Hadoop data store I know nothing about?
ETA: This would be when running Pig in interactive mode, rather than loading a script. Probably obvious, but I thought I should mention it.
The closest thing I can see to 'show tables' is the 'history' command, which effectively lists all aliases created.
grunt> history
1 a = LOAD 'iris.csv' USING PigStorage (',') AS
2 b = FILTER a BY spec==1;
3 c = GROUP b BY pw;
Pig doesn't have a concept of tables. It can read any file that is on your HDFS filesystem and stores the parsed result in a relation.
Note that you can also run HDFS filesystem commands from the grunt shell
It's probably best you familiarise yourself with HDFS first and make sure you can comfortably navigate the filesystem first so you can find what data you want to process with Pig.
We had also came across similar situation and applied all solutions of stackoverflow but none had solved my issue . Now solution of these problem is that , you should use store command of pig and also provide dedicated folder for it .
Now the set up which we prefer is ,
grunt> fs -mkdir /user/hduser/AllPigTableStructures/
grunt> fs -chmod 777 /user/hduser/AllPigTableStructures/
Now we will store all table informations into these folder named "AllPigTableStructures".
Then you should use "store" function as below code,
grunt> store extract_details into '/user/hduser/AllPigTableStructures/SchemaTwit' using PigStorage('\t', '-schema');
the last line of these code should be
/*2017-09-18 02:13:56,566 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
Now you should see a folder with named SchemaTwit like these,
grunt> fs -ls /user/hduser/AllPigTableStructures
Found 12 items
drwxr-xr-x - hduser supergroup 0 2017-09-18 02:13 /user/hduser/AllPigTableStructures/SchemaTwit
and at last if you will see content of SchemaTwit directory then it will contain your schema of your table and all details about your table below is command for it and part-m-xxx kind of file will contains your data part.
grunt> fs -ls /user/hduser/AllPigTableStructures/SchemaTwit
Found 4 items
-rw-r--r-- 2 hduser supergroup 8 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/.pig_header
-rw-r--r-- 2 hduser supergroup 239 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/.pig_schema
-rw-r--r-- 2 hduser supergroup 0 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/_SUCCESS
-rw-r--r-- 2 hduser supergroup 140 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/part-m-00000
Now you can use below cat command on schema file to see schema of your table of part-m-xxx for browsing your data part
grunt> fs -cat /user/hduser/AllPigTableStructures/SchemaTwit/.pig_schema
{"fields":[{"name":"id","type":50,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"text","type":50,"description":"autogenerated from Pig Field Schema","schema":null}],"version":0,"sortKeys":[],"sortKeyOrders":[]}
Now for loading your table with schema these command help,
WithSchema = LOAD '/user/hduser/AllPigTableStructures/SchemaTwit';
PS: We are running our pig into mapreduce mode .
Looks like you have mistaken Pig. As #seedhead has specified, you handle files with Pig. Folks quite often mistake it as a a database(like Hbase) or a warehouse(like Hive), which it is not. And, as far as visualizing the data is concerned, you could list the files and directories through Pig shell. And if you need to see how many records(or lines) a particular files has, you could do something like this :
Records = LOAD '/path_of_the_file';
Records_Group= GROUP Records ALL;
Records_Count = FOREACH Records_Group GENERATE COUNT(Records);

Uploading media in Wordpress

I am trying to upload images or any other media type to my wordpress application, but I get this error:
Unable to create directory /home/admin/video/wp-content/uploads/2012/07. Is its parent directory writable by the server?
even though I am sure that the parent directory is writable. It actually has 777 permissions. What might be the problem?
Thank you.
The question is... writable by who or what? You probably need to make the entire "uploads" directory writable by PHP (a.k.a. the web server). Often, apache and other servers default to the user-group www-data, but it could be different. Check your apache or lighttpd (or whatever) configuration files to see what user and user-group it runs as. Often these are in /etc/apache or /etc/lighttpd et cetera. Then, make the uploads directory recursively writable to that group.
Using 777 permissions is a very bad idea. You always want to give the minimal amount of people access to any given directory. So, here's a short discourse on file permissions....
drwxrwxrwx 20 connermcd staff 680 Jul 25 20:38 img
-rw-r--r-- 1 admin www-data 18530 Jul 26 21:46 example
The first character of the permissions string denotes the type. In this case, img is a directory and example is a file. This could also be an l for a symbolic link (among other things). The remaining characters of the string (rwxrwxrwx) define permissions. As you can see, it's a repeating triplet of "read, write, execute". The first triad represents permission for the file or directory's owner. The owner is shown in the third column (connermcd for img and admin for example). The second triad denotes permission for the file or directory's group (staff for img and www-data for example). The last triad denotes permissions for anyone (even someone you gave temporary access to your server or a hacker, hint hint).
Each of the "read, write, execute" triads can be represented by a number. It's easy for me to think about rwxrwxrwx as 421421421. It's the only way multiples of two can add up to 7 if that helps you. So, the 4 stands for read, the 2 stands for write, and the 1 stands for execute. If you add these together then you can denote a triad with three numbers. So what chmod 777 img is really doing is giving "read, write, and execute" permission to everyone. It is also only setting those permissions for that directory and not the directories underneath it. To do this recursively you can use the -R flag -- chmod -R.
In your case, you just want to make the uploads folder and all its subdirectories available to the user group your server runs as. In most cases that's www-data, so I'll use that as an example. You probably want to set your project files as owned by your user to make them easier to move, edit, etc. So let's assume you are the owner of the files (use chown to set) and that they belong to the www-data group (use chgrp to set). In that case we want to give the owner full permissions and the group read and write permissions, and we want to do it recursively. So go to the parent directory of the uploads folder and do chmod -R 760 uploads.
You may also see if is correct your "Settings->Media" and then look to "Uploading Files" section.
The folder(and all subfolders) indicated into "Store uploads in this folder" must have 755 permissions.
