I want to pipe the output from bzip2 and use it as an input to fill a TDB database using tbdloader2 from apache-jena-3.9.0.
I already found
Generating TDB Dataset from archive containing N-TRIPLES files
but the proposed solution there did not work for me.
bzip2 -dc test.ttl.bz2 | tdbloader2 --loc=/pathto/TDBdatabase_test -- -
produces
20:08:01 INFO -- TDB Bulk Loader Start
20:08:01 INFO Data Load Phase
20:08:01 INFO Got 1 data files to load
20:08:01 INFO Data file 1: /home/user/-
File does not exist: /home/user/-
20:08:01 ERROR Failed during data phase
Similar results I got with with (inspired by https://unix.stackexchange.com/questions/16990/using-data-read-from-a-pipe-instead-than-from-a-file-in-command-options)
bzip2 -dc test.ttl.bz2 | tdbloader2 --loc=/pathto/TDBdatabase_test /dev/stdin
20:34:45 INFO -- TDB Bulk Loader Start
20:34:45 INFO Data Load Phase
20:34:45 INFO Got 1 data files to load
20:34:45 INFO Data file 1: /proc/16256/fd/pipe:[92062]
File does not exist: /proc/16256/fd/pipe:[92062]
20:34:45 ERROR Failed during data phase
and
bzip2 -dc test.ttl.bz2 | tdbloader2 --loc=/pathto/TDBdatabase_test /dev/fd/0
20:34:52 INFO -- TDB Bulk Loader Start
20:34:52 INFO Data Load Phase
20:34:52 INFO Got 1 data files to load
20:34:52 INFO Data file 1: /proc/16312/fd/pipe:[97432]
File does not exist: /proc/16312/fd/pipe:[97432]
20:34:52 ERROR Failed during data phase
unpacking the bz2 file manually and then adding it works fine:
bzip2 -d test.ttl.bz2
tdbloader2 --loc=/pathto/TDBdatabase_test test.ttl
Would be great if someone could point me in the right direction.
tdbloader2 accepts bz2 compressed files on the command line:
tdbloader2 --loc=/pathto/TDBdatabase_test test.ttl.bz2
It doesn't accept input from a pipe - and if it did, then it would not know the syntax is Turtle which it gets from the file extension.
Related
I'm trying to relog performance monitor circular log.
When I do this with "normal" binary file, it's working correctly:
C:\PerfLogs\Admin\Test>relog DataCollector01.blg -f csv -o test.csv
Input
----------------
File(s):
DataCollector01.blg (Binary)
Begin: 2016-11-22 8:18:18
End: 2016-11-22 8:21:18
Samples: 13
100.00%
Output
----------------
File: test.csv
Begin: 2016-11-22 8:18:18
End: 2016-11-22 8:21:18
Samples: 13
The command completed successfully.
But when I created a circular log, then I get the error:
C:\PerfLogs\Admin\Test>relog DataCollector01.blg -f csv -o test.csv
Input
----------------
File(s):
DataCollector01.blg (Binary)
Error:
Unable to read counter information and data from input binary log files.
The DataCollector is running. When I stop it, then I can relog the blg file.
You cannot read a lot while it's open. You have to stop the logging first.
I have fsimage stored in my local directory, Using offline viewer command specified at 'https://archive.cloudera.com/cdh/3/hadoop/hdfs_imageviewer.html' I have followed the instruction and executed the below command :
hadoop oiv -i fsimage -o fsimage.txt
Output is :
16/06/24 08:09:18 INFO offlineImageViewer.FSImageHandler: Loading 24 strings
16/06/24 08:09:18 INFO offlineImageViewer.FSImageHandler: Loading 3027842 inodes.
16/06/24 08:09:32 INFO offlineImageViewer.FSImageHandler: Loading inode references
16/06/24 08:09:32 INFO offlineImageViewer.FSImageHandler: Loaded 0 inode references
16/06/24 08:09:32 INFO offlineImageViewer.FSImageHandler: Loading inode directory section
16/06/24 08:09:35 INFO offlineImageViewer.FSImageHandler: Loaded 1446245 directories
16/06/24 08:09:35 INFO offlineImageViewer.WebImageViewer: WebImageViewer started. Listening on /127.0.0.1:5978. Press Ctrl+C to stop the viewer.
But the fsimage.txt file is of zero size, I have executed the command for XML format
hdfs oiv -p XML -i fsimage -o fsimage.xml
Which gives me fsimage.xml but I want this to be in '.txt' format, also In the output it says :
WebImageViewer started. Listening on /127.0.0.1:5978. Press Ctrl+C to stop the viewer.
Is there any UI available for accessing it, if yes how can we access it?
You can use indented format which is delineate the section of fsImage into separate levels of indentation.This also will be saved in txt format
You can refer this http://hadooptutorial.info/oiv-hdfs-offline-image-viewer/
I am trying to do a bulkload which is a csv file using command line.
This is what I am trying
bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs://localhost:9000/transactionsFile.csv bulkLoadtable
The error I am getting is below:
15/09/01 13:49:44 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://localhost:9000/transactionsFile.csv
15/09/01 13:49:44 WARN mapreduce.LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory hdfs://localhost:9000/transactionsFile.csv. Does it contain files in subdirectories that correspond to column family names?
Is it possible to do bulkload from command line without using java mapreduce.
You are almost correct, only thing missed is that the input to the bulkLoadtable must be directory. I suggest to keep the csv file under a directory and pass the path upto directory name as an argument to the command. Please refer the below link.
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.html#doBulkLoad(org.apache.hadoop.fs.Path,%20org.apache.hadoop.hbase.client.Admin,%20org.apache.hadoop.hbase.client.Table,%20org.apache.hadoop.hbase.client.RegionLocator)
Hope this helps.
You can do bulk load from command line,
There are multiple ways to do this,
a. Prepare your data by creating data files (StoreFiles) from a MapReduce job using HFileOutputFormat.
b. Import the prepared data using the completebulkload tool
eg: hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
more details,
hbase bulk load
2.
Using importtsv
eg:
hbase> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,id,temp:in,temp:out,vibration,pressure:in,pressure:out" sensor hdfs://sandbox.hortonworks.com:/tmp/hbase.csv
more details
I'm using following simple code to upload files to hdfs.
FileSystem hdfs = FileSystem.get(config);
hdfs.copyFromLocalFile(src, dst);
The files are generated by webserver java component and rotated and closed by logback in .gz format. I've noticed that sometimes the .gz file is corrupted.
> gunzip logfile.log_2013_02_20_07.close.gz
gzip: logfile.log_2013_02_20_07.close.gz: unexpected end of file
But the following command does show me the content of the file
> hadoop fs -text /input/2013/02/20/logfile.log_2013_02_20_07.close.gz
The impact of having such files is quite disaster - since the aggregation for the whole day fails, and also several slave nodes is marked as blacklisted in such case.
What can I do in such case?
Can hadoop copyFromLocalFile() utility corrupt the file?
Does anyone met similar problem ?
It shouldn't do - this error is normally associated with GZip files which haven't been closed out when originally written to local disk, or are being copied to HDFS before they have finished being written to.
You should be able to check by running an md5sum on the original file and that in HDFS - if they match then the original file is corrupt:
hadoop fs -cat /input/2013/02/20/logfile.log_2013_02_20_07.close.gz | md5sum
md5sum /path/to/local/logfile.log_2013_02_20_07.close.gz
If they don't match they check the timestamps on the two files - the one in HDFS should be modified after the local file system one.
I'm trying to load a pipe delimited file ('|') in pig using the following command:
A = load 'test.csv' using PigStorage('|');
But I keep getting this error:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. java.net.URISyntaxException cannot be cast to java.lang.Error
I've looked all over, but I can't find any reason this would happen. The test file I have above is a simple file that just contains 1|2|3 for testing.
If you are running Pig in MAPREDUCE as the ExecType mode, then the following command should work
A = LOAD '/user/pig/input/pipetest.csv' USING PigStorage('|');
DUMP A;
Here is the output on your screen
(1,2,3)
Note that I have included the full path in HDFS for my csv file in the LOAD command