Different file process in hadoop - hadoop

I have installed Hadoop and hive. I can process and query over xls, tsv files using hive. I want to process other files such as docx, pdf, ppt. how can i do this? Is there any separate procedure to process these files in AWS? please help me.

There isn't any difference in consuming those files as in any Hadoop platform. For easy access and durable storage - you may put those files in S3.

Related

How to save large files in HDFS and links in HBase?

I have read that it is recommended to save files more than 10MB in HDFS and store path of that file in HBase. Is there any recommended approach of doing this. Is there any specific configurations or tools like Apache Phoenix that can help us achieve this?
Or all of the saving data in HDFS and then saving the location in HBase then reading the path from HBase then reading data from HDFS with the location all be done manually from the client?

How to run analytics on Paraquet files on Non Hadoop environment

We are generating Parquet files , using apache Nifi in a non hadoop environment. We need to run analytics on Parquet files.
Apart from using apache frameworks like Hive , Spark etc. Do we have any open source BI or a reporting tool which can read Parquet files , or is there any other work around for this . In our environment we have Jasper Reporting tool.
Any suggestion is appreciated. Thanks.
You can easily process Parquet files in Python:
To read/write Parquet files, you can use pyarrow or fastparquet.
To analyze the data, you can use Pandas (which can even read/write Parquet itself using one of the implemention mentioned in the previous item behind the scenes).
To get a nice interactive data exploration environment, you can use Jupyter Notebook.
All of these work in a non-Hadoop environment.

Is it possible join a lot of files in Apache Flume?

Our server receive a lot of files every moment. Size of files is pretty small. Around 10 MB. Our management want to make Hadoop cluster for analysis and storage of these files. But it is not effective to storage small files in hadoop. Is it any options in hadoop or in Flume to join (make one big file) this files?
Thanks a lot for help.
Here's what comes to my mind:
1) Use Flume's "Spooling Directory Source". This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk.
Write your files to that directory.
2) Use whichever channel you want for Flume: "memory" or "File". Both have advantages and disadvantages.
3) Use HDFS Sink to write to HDFS.
The "spooling directory source" will rename the file once ingested (or optionally delete). The data also survives crash or restart.
Here's the documentation:
https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source

namenode.LeaseExpiredException while df.write.parquet when reading from non-hdfs source

I have a spark code that runs on a yarn cluster and converts csv to parquet using databricks library.
It works fine when the csv source is hdfs. But when the csv source is non-hdfs, which is usually the case, I come across this exception.
It should not happen as the same code works for hdfs csv source.
Complete link to the issue :
https://issues.apache.org/jira/browse/SPARK-19344
As discussed in the comments.
When the files are on the driver node, but not access-able by the nodes, the read will fail.
When using reading input file (e.g. spark.read in spark 2.0), the files should be be access by all executors nodes (e.g. when the files are on HDFS, cassandra, etc)

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications
A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.
HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line
I'd recommend using Flume to collect the log files from your servers into HDFS.

Resources