How should look like a morphline for MapReduceIndexerTool? - hadoop

I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.
For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.
Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.
I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?
I can't find any example on a morphline adapted for this map reduce job.
Thank you respectfully.

Cloudera has a guide which is having almost similar use case given under morphline.
In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.
The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:

In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.
For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:
SOLR_LOCATOR : {
collection : collection1
zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readAvroContainer {}
}
{
extractAvroPaths {
flatten : false
paths : {
id : /id
field1_s : /field1
field2_s : /field2
}
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.
This is command I'm using to index files and merge them to running collection:
sudo -u hdfs hadoop --config /etc/hadoop/conf \
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file /local/path/morphlines_file \
--output-dir hdfs://localhost/mrit/out \
--zk-host localhost:2181/solr \
--collection collection1 \
--go-live \
hdfs:/mrit/in/my-avro-file.avro
Solr should be configured to work with HDFS and collection should exist.
All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.

Related

MapReduceIndexerTool output dir error "Cannot write parent of file"

I want to use Cloudera's MapReduceIndexerTool to understand how morphlines work. I created a basic morphline that just reads lines from the input file and I tried to run that tool using that command:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphline.conf \
--output-dir hdfs:///hostname/dir/ \
--dry-run true
Hadoop is installed on the same machine where I run this command.
The error I'm getting is the following:
net.sourceforge.argparse4j.inf.ArgumentParserException: Cannot write parent of file: hdfs:/hostname/dir
at org.apache.solr.hadoop.PathArgumentType.verifyCanWriteParent(PathArgumentType.java:200)
The /dir directory has 777 permissions on it, so it is definitely allowed to write into it. I don't know what I should do to allow it to write into that output directory.
I'm new to HDFS and I don't know how I should approach this problem. Logs don't offer me any info about that.
What I tried until now (with no result):
created a hierarchy of 2 directories (/dir/dir2) and put 777 permissions on both of them
changed the output-dir schema from hdfs:///... to hdfs://... because all the examples in the --help menu are built that way, but this leads to an invalid schema error
Thank you.
It states 'cannot write parent of file'. And the parent in your case is /. Take a look into the source:
private void verifyCanWriteParent(ArgumentParser parser, Path file) throws ArgumentParserException, IOException {
Path parent = file.getParent();
if (parent == null || !fs.exists(parent) || !fs.getFileStatus(parent).getPermission().getUserAction().implies(FsAction.WRITE)) {
throw new ArgumentParserException("Cannot write parent of file: " + file, parser);
}
}
In the message printed is file, in your case hdfs:/hostname/dir, so file.getParent() will be /.
Additionally you can try the permissions with hadoop fs command, for example you can try to create a zero length file in the path:
hadoop fs -touchz /test-file
I solved that problem after days of working on it.
The problem is with that line --output-dir hdfs:///hostname/dir/.
First of all, there are not 3 slashes at the beginning as I put in my continuous trying to make this work, there are only 2 (as in any valid HDFS URI). Actually I put 3 slashes because otherwise, the tool throws an invalid schema exception! You can easily see in this code that the schema check is done before the verifyCanWriteParent check.
I tried to get the hostname by simply running the hostname command on the Cent OS machine that I was running the tool on. This was the main issue. I analyzed the /etc/hosts file and I saw that there are 2 hostnames for the same local IP. I took the second one and it worked. (I also attached the port to the hostname, so the final format is the following: --output-dir hdfs://correct_hostname:8020/path/to/file/from/hdfs
This error is very confusing because everywhere you look for the namenode hostname, you will see the same thing that the hostname command returns. Moreover, the errors are not structured in a way that you can diagnose the problem and take a logical path to solve it.
Additional information regarding this tool and debugging it
If you want to see the actual code that runs behind it, check the cloudera version that you are running and select the same branch on the official repository. The master is not up to date.
If you want to just run this tool to play with the morphline (by using the --dry-run option) without connecting to Solr and playing with it, you can't. You have to specify a Zookeeper endpoint and a Solr collection or a solr config directory, which involves additional work to research on. This is something that can be improved to this tool.
You don't need to run the tool with -u hdfs, it works with a regular user.

Concatenating multiple text files into one very large file in HDFS

I have the multiple text files.
The total size of them exceeds the largest disk size available to me (~1.5TB)
A spark program reads a single input text file from HDFS. So I need to combine those files into one. (I cannot re-write the program code. I am given only the *.jar file for execution)
Does HDFS have such a capability? How can I achieve this?
What I understood from your question is you want to Concatenate multiple files into one. Here is a solution which might not be the most efficient way of doing it but it works. suppose you have two files: file1 and file2 and you want to get a combined file as ConcatenatedFile
.Here is the script for that.
hadoop fs -cat /hadoop/path/to/file/file1.txt /hadoop/path/to/file/file2.txt | hadoop fs -put - /hadoop/path/to/file/Concatenate_file_Folder/ConcatenateFile.txt
Hope this helps.
HDFS by itself does not provide such capabilities. All out-of-the-box features (like hdfs dfs -text * with pipes or FileUtil's copy methods) use your client server to transfer all data.
In my experience we always used our own written MapReduce jobs to merge many small files in HDFS in distributed way.
So you have two solutions:
Write your own simple MapReduce/Spark job to combine text files with
your format.
Find already implemented solution for such kind of
purposes.
About solution #2: there is the simple project FileCrush for combining text or sequence files in HDFS. It might be suitable for you, check it.
Example of usage:
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
--input-format=text \
--output-format=text \
--compress=none \
/input/dir /output/dir 20161228161647
I had a problem to run it without these options (especially -Ddfs.block.size and output file date prefix 20161228161647) so make sure you run it properly.
You can do a pig job:
A = LOAD '/path/to/inputFiles' as (SCHEMA);
STORE A into '/path/to/outputFile';
Doing a hdfs cat and then putting it back to hdfs means, all this data is processed in the client node and will degradate your network

Collecting Parquet data from HDFS to local file system

Given a Parquet dataset distributed on HDFS (metadata file + may .parquet parts), how to correctly merge parts and collect the data onto local file system? dfs -getmerge ... doesn't work - it merges metadata with actual parquet files..
There is a way involving Apache Spark APIs - which provides a solution, but more efficient method without third-party tools may exist.
spark> val parquetData = sqlContext.parquetFile("pathToMultipartParquetHDFS")
spark> parquet.repartition(1).saveAsParquetFile("pathToSinglePartParquetHDFS")
bash> ../bin/hadoop dfs -get pathToSinglePartParquetHDFS localPath
Since Spark 1.4 it's better to use DataFrame::coalesce(1) instead of DataFrame::repartition(1)
you may use pig
A = LOAD '/path/to parquet/files' USING parquet.pig.ParquetLoader as (x,y,z) ;
STORE A INTO 'xyz path' USING PigStorage('|');
You may create Impala table on to it, & then use
impala-shell -e "query" -o <output>
same way you may use Mapreduce as well
You may use parquet tools
java -jar parquet-tools.jar merge source/ target/

How can multiple files be specified with "-files" in the CLI of Amazon for EMR?

I am trying to start an amazon cluster via the amazon CLI, but I am a little bit confused how I should specify multiple files. My current call is as follows:
aws emr create-cluster --steps Type=STREAMING,Name='Intra country development',ActionOnFailure=CONTINUE,Args=[-files,s3://betaestimationtest/mapper.py,-
files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-
input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
--ami-version 3.1.0
--instance-groupsInstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
--log-uri s3://betaestimationtest/logs
However, Hadoop now complains that it cannot find the reducer file:
Caused by: java.io.IOException: Cannot run program "reducer.py": error=2, No such file or directory
What am I doing wrong? The file does exist in the folder I specify
For passing multiple files in a streaming step, you need to use file:// to pass the steps as a json file.
AWS CLI shorthand syntax uses comma as delimeter to separate a list of args. So when we try to pass in parameters like: "-files","s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py", then the shorthand syntax parser will treat mapper.py and reducer.py files as two parameters.
The workaround is to use the json format. Please see the examples below.
aws emr create-cluster --steps file://./mysteps.json --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate --log-uri s3://betaestimationtest/logs
mysteps.json looks like:
[
{
"Name": "Intra country development",
"Type": "STREAMING",
"ActionOnFailure": "CONTINUE",
"Args": [
"-files",
"s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py",
"-mapper",
"mapper.py",
"-reducer",
"reducer.py",
"-input",
" s3://betaestimationtest/output_0_inte",
"-output",
" s3://betaestimationtest/output_1_intra"
]}
]
You can also find examples here: https://github.com/aws/aws-cli/blob/develop/awscli/examples/emr/create-cluster-examples.rst. See example 13.
Hope it helps!
You are specifying -files twice, you only need to specify once. I forget if the CLI needs the separator to be a space or a comma for multiple values, but you can try that out.
You should replace:
Args=[-files,s3://betaestimationtest/mapper.py,-files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
with:
Args=[-files,s3://betaestimationtest/mapper.py s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
or if that fails, with:
Args=[-files,s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
Add an escape for comma separating files:
Args=[-files,s3://betaestimationtest/mapper.py\\,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]

Dump all documents of Elasticsearch

Is there any way to create a dump file that contains all the data of an index among with its settings and mappings?
A Similar way as mongoDB does with mongodump
or as in Solr its data folder is copied to a backup location.
Cheers!
Here's a new tool we've been working on for exactly this purpose https://github.com/taskrabbit/elasticsearch-dump. You can export indices into/out of JSON files, or from one cluster to another.
Elasticsearch supports a snapshot function out of the box:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html
We can use elasticdump to take the backup and restore it, We can move data from one server/cluster to another server/cluster.
1. Commands to move one index data from one server/cluster to another using elasticdump.
# Copy an index from production to staging with analyzer and mapping:
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=http://staging.es.com:9200/my_index \
--type=analyzer
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=http://staging.es.com:9200/my_index \
--type=mapping
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=http://staging.es.com:9200/my_index \
--type=data
2. Commands to move all indices data from one server/cluster to another using multielasticdump.
Backup
multielasticdump \
--direction=dump \
--match='^.*$' \
--limit=10000 \
--input=http://production.es.com:9200 \
--output=/tmp
Restore
multielasticdump \
--direction=load \
--match='^.*$' \
--limit=10000 \
--input=/tmp \
--output=http://staging.es.com:9200
Note:
If the --direction is dump, which is the default, --input MUST be a URL for the base location of an ElasticSearch server (i.e. http://localhost:9200) and --output MUST be a directory. Each index that does match will have a data, mapping, and analyzer file created.
For loading files that you have dumped from multi-elasticsearch, --direction should be set to load, --input MUST be a directory of a multielasticsearch dump and --output MUST be a Elasticsearch server URL.
The 2nd command will take a backup of settings, mappings, template and data itself as JSON files.
The --limit should not be more than 10000 otherwise, it will give an exception.
Get more details here.
For your case Elasticdump is the perfect answer.
First, you need to download the mapping and then the index
# Install the elasticdump
npm install elasticdump -g
# Dump the mapping
elasticdump --input=http://<your_es_server_ip>:9200/index --output=es_mapping.json --type=mapping
# Dump the data
elasticdump --input=http://<your_es_server_ip>:9200/index --output=es_index.json --type=data
If you want to dump the data on any server I advise you to install esdump through docker. You can get more info from this website Blog Link
ElasticSearch itself provides a way to create data backup and restoration. The simple command to do it is:
CURL -XPUT 'localhost:9200/_snapshot/<backup_folder name>/<backupname>' -d '{
"indices": "<index_name>",
"ignore_unavailable": true,
"include_global_state": false
}'
Now, how to create, this folder, how to include this folder path in ElasticSearch configuration, so that it will be available for ElasticSearch, restoration method, is well explained here. To see its practical demo surf here.
At the time of writing this answer(2021), the official way of backing up an ElasticSearch cluster is to snapshot it. Refer to: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html
The data itself is one or more lucene indices, since you can have multiple shards. What you also need to backup is the cluster state, which contains all sorts of information regarding the cluster, the available indices, their mappings, the shards they are composed of etc.
It's all within the data directory though, you can just copy it. Its structure is pretty intuitive. Right before copying it's better to disable automatic flush (in order to backup a consistent view of the index and avoiding writes on it while copying files), issue a manual flush, disable allocation as well. Remember to copy the directory from all nodes.
Also, next major version of elasticsearch is going to provide a new snapshot/restore api that will allow you to perform incremental snapshots and restore them too via api. Here is the related github issue: https://github.com/elasticsearch/elasticsearch/issues/3826.
You can also dump elasticsearch data in JSON format by http request:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
CURL -XPOST 'https://ES/INDEX/_search?scroll=10m'
CURL -XPOST 'https://ES/_search/scroll' -d '{"scroll": "10m", "scroll_id": "ID"}'
To export all documents from ElasticSearch into JSON, you can use the esbackupexporter tool. It works with index snapshots. It takes the container with snapshots (S3, Azure blob or file directory) as the input and outputs one or several zipped JSON files per index per day. It is quite handy when exporting your historical snapshots. To export your hot index data, you may need to make the snapshot first (see the answers above).
If you want to massage the data on its way out of Elasticsearch, you might want to use Logstash. It has a handy Elasticsearch Input Plugin.
And then you can export to anything, from a CSV file to reindexing the data on another Elasticsearch cluster. Though for the latter you also have the Elasticsearch's own Reindex.

Resources