Word Count Hadoop Example

Word Count Hadoop Example - hadoop

I am running word count ex on a 41 GB file ( with default configuration setting ) that comes with Hadoop( Version: 0.20.3-dev) . But this code is giving correct output for the small file but it is giving some garbage for the 41 GB file. Why is this happening ?

Thanks for everybody.It may create wrong output because Hadoop by default does not know your file format it treats every file as a simple text file.

Related

Import file failed to greenplum because of one line of data on navicate

When importing a file into Greenplum,one lines fails,and the whole file is not imported successfully.Is there a way can skip the wrong line and import other data into Greenplum successfully?
Here are my SQL execution and error messages:
copy cjh_test from '/gp_wkspace/outputs/base_tables/error_data_test.csv' using delimiters ',';
ERROR: invalid input syntax for integer: "FE00F760B39BD3756BCFF30000000600"
CONTEXT: COPY cjh_test, line 81, column local_city: "FE00F760B39BD3756BCFF30000000600"

Greenplum has an extension to the COPY command that lets you log errors and set up a certain amount of errors that can occur that won't stop the load. Here is an example from the documentation for the COPY command:
COPY sales FROM '/home/usr1/sql/sales_data' LOG ERRORS
SEGMENT REJECT LIMIT 10 ROWS;
That tells COPY that 10 bad rows can be ignored without stopping the load. The reject limit can be # of rows or a percentage of the load file. You can check the full syntax in psql with: \h copy
If you are loading a very large file into Greenplum, I would suggest looking at gpload or gpfdist (which also support the segment reject limit syntax). COPY is single threaded through the master server where gpload/gpfdist load the data in parallel to all segments. COPY will be faster for smaller load files and the others will be faster for millions of rows in a load file(s).

Hadoop : Using Pig to add text at the end of every line of a hdfs file

We have files in HDFS with raw logs, each individual log is a line as these logs are line separated.
Our requirement is that to add a text (' 12345' for e.g. ) by the end of every log in these files ... using pig / hadoop command / or any other map reduce based tool.
Please advice
Thanks
AJ

Load the files where each log entry is loaded into one field i.e. line:chararray and use CONCAT to add the text to each line.Store it into new log file.If you want the individual files then you will have to parameterize the script to load each file and store into a new file instead of wildcard load.
Log = LOAD '/path/wildcard/*.log' USING TextLoader(line:chararray);
Log_Text = FOREACH Log GENERATE CONCAT(line,'Your Text') as newline;
STORE Log_Text INTO /path/NewLog.log';

If your files aren't extremely large, you can do that with a single shell command.
hdfs dfs -cat /user/hdfs/logfile.log | sed 's/$/12345/g' |\
hdfs dfs -put - /user/hdfs/newlogfile.txt

Hive creates empty table, even there're plenty of file

I put some files into hdfs (/path/to/directory/) which contain data like following;
63 EB44863EA74AA0C5D3ECF3D678A7DF59
62 FABBC9ED9719A5030B2F6A4591EDB180
59 6BF6D40AF15DE2D7E295EAFB9574BBF8
All of them named as _user_hive_warehouse_file_name_000XYZ_A. These files had downloaded from another hdfs.
I'm trying to create external table via Hive;
CREATE EXTERNAL TABLE users(
id int,
user string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/path/to/directory/';
It says;
OK
Time taken: 0.098 seconds
select * from users; returns empty.
select count(1) from users; returns 0.
Hive creates the table successfully, but it's always empty. If I put another file like another.txt, that contains the sample data mentioned above, select count(1) from users; returns 3.
What am I missing, why the table is empty?
Environment:
JDK 7
Hadoop 2.6.0
Hive 0.14.0
Ubuntu 14.04

I think you are encountering an issue that is peripherally discussed in HIVE-6431. In particular, this comment is the important one:
By default, FileInputFormat(which is the super class of various formats) in hadoop ignores file name starts with "_" or ".", and hard to walk around this in hive codebase.
The workaround is probably to avoid using filenames that begin with _ or .

When you run any command on Hive, it is run internally as a MapReduce Job on the HDFS path that you stored the file. The job uses the FileInputFormat to read the HDFS files which has a hiddenFileFilter which ignores any files starting with underscore ("_") and ("."). You can actually set other files to ignore by setting the FileInputFormat.SetInputPathFilter to a CustomPathFilter. Hadoop uses the files with underscores are "special" files to show job output and logs. This is probably why they are ignored.

Generating "terasort" input data set with TeraGen

I want to generate a data set (for my own "terasort" MapReduce job) by running the TeraGen program that ships with Hadoop (inside hadoop-examples.jar):
hadoop jar /<full-path>/lib/hue/apps/oozie/examples/lib/hadoop-examples.jar teragen 1000 ./teragen
I am not getting the expected output that should follow the format:
(10 bytes key) (10 bytes rowid) (78 bytes filler) \r \n
I am getting a file that:
starts with JimGrayRIP followed by a NUL character (when I am trying to paste it, it gets truncated; I uploaded a copy to Dropbox),
contains two characters repeated every 100 bytes, but - instead of OD OA - they are EE FF.
What can be wrong?
Can this be any encoding issue?
Is a sample "terasort" data set available for download anywhere?

got error 22 from storage engine mysql

mysqldump: Error: 'got error 22 from storage engine' when trying to dump
tablespaces
mysqldump: Got error: 23: Out of resources when opening file '.\database\table.MYD' (Errcode: 24) when using LOCK TABLES
i got this error when trying to make a dump in any database that I select , looks like that database is corrupted , is possible repair that ?

You seem to have reached the maximum number of open files. This limit is either MySQL's or the system's.
increase the value for the open_files_limit in your MySQL configuration file (this directive does not exist in a default installation, so you might need to create it in the [mysqld] section)
increase the limit at system level (but I am not sure this applies to Windows)

Here are some reasons for this error:
Type “source path-to-SQL-file“. BUT, you must follow these rules:
Use the full source command, not the . shortcut.
Have no spaces in your path. I copied mine to a root of a drive. Note that spaces in the file name is OK, just not the path.
Do not quote the file name, even if it has spaces. This gave error 22.
Use forward slashes in the path, e.g., C:/path/to/filename.sql. Otherwise you’ll get error 2.
Do not end with a semicolon.

Please check your read write access to the drive where you have stored your mySQL database.
error 22 occurred usually when you have no write access to that drive.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Word Count Hadoop Example - hadoop

I am running word count ex on a 41 GB file ( with default configuration setting ) that comes with Hadoop( Version: 0.20.3-dev) . But this code is giving correct output for the small file but it is giving some garbage for the 41 GB file. Why is this happening ?

Thanks for everybody.It may create wrong output because Hadoop by default does not know your file format it treats every file as a simple text file.

Related

Import file failed to greenplum because of one line of data on navicate

Hadoop : Using Pig to add text at the end of every line of a hdfs file

Hive creates empty table, even there're plenty of file

Generating "terasort" input data set with TeraGen

got error 22 from storage engine mysql

Categories

Resources