Apache Solr support for ORC file format - hadoop

I have a bunch of tables in Hive, stored as ORC. I want to index their data in a SolrCloud collection.
Is there any support for indexing data stored in ORC format in Solr?
I've googled around but nothing came out.

Looks like you want SolR to read data from a specific Hive file format.
You might look at the problem the other way i.e. use Hive to write data to SolR -- and thus let Hive take care of the complexity of the actual input file format (whether ORC, Parquet, AVRO, whatever -- even HBase data files).
In the LucidWorks GitHub repo you will find a project labeled hive-solr. Have a look.

I'll accept Samson's answer.
Anyway, I'm not fully satisfied about this solution. In fact, now I still need to create an external table manually declaring all fields in the original table. In terms of operations, it is not different from creating a new table (stored ad textfile) starting from the original one, indexing the new text files and finally dropping them (of course, this may be a problem for very large tables, which is not my case).
Being ORC a self-describing format, it would be great for Solr to read both field names and data directly from the compressed files.

Related

1 Billion records join(Filters) in Spark with Parquet file format vs HadoopText Input format

When reading a 1 Billion records of a table in Spark from Hive and this table have date and country columns as partitions. It is running for very long time since we are doing many transformations on it. If I change the Hive table file format to Parquet then will it be there any performance? Any suggestions on improvement of performance .
Change the Orc to Parquet maybe will not improve the performance.
But it depends of the type of data you have. If you are working with nested objects you need to use Parquet, Orc is not good for that.
But to create some improvement, I suggest you to do some steps that can help with your data in Hive.
Check the number of files in Hive.
One common thing that can create big problems in Hive Query is the number of files in each partition, and the size of these files are. If you are using Spark to store the data, I suggest you to check the size of the files and if they are stored with the size of your Hadoop block. If not, try to use the command CONCATENATE to solve that problem. As you can see here.
Predicate PushDown
This is what Hive, and Orc files can give you with the best performance in query the data. I suggest you to run one ANALYSE command to force the creation of the Statistics of your table, this will improve the performance and if the data is not efficient this will help. Check here and with this will update the Hive Metastore and will give you some relevant data information.
Ordered Data
If it is possible, try to store your data ordered by some column, and filter and do other stuffs in that column. Your join can be improved with this.

Data storage format for unstructured data rows on HDFS

We are consuming very large data that needs to be written as fast as we receive and we are using HDFS, so we prefer using it. The data is almost unstructured, and we will be doing basic queries on them rarely. The data is flat with some fields, each row representing another data.
key1=str key2=30.3 key3=longtexthere
Another data row:
key1=3 key5=abc
SequenceFile seemed the most natural one but I could not find how to store multiple rows in a single SequenceFile.
Currently, in our temporary solution, we have multiple writers that writes to multiple text files. So when querying is needed, we read them in parallel. However, current text files contains 1000s of rows and I don't think creating a single SequenceFile for each row would be feasiable, it would incur much overhead for storing metadata and reading many too many files at once when querying.
I think the problem can be solved by using HBase or Cassandra, a columunar database but we are almost required to use HDFS. Am I missing something with SequenceFiles or we should really use a columunar database?
So sequence file format is like this:
<key, value>
<key, value>
<key, value>
...
where the key is a WritableComparable and the value is a Writable.
Now what a lot of people are doing - and you could do the same - is:
Only use the key OR the value 'column'
Implement a custom Writable which wraps a set of other Writables (call it record, row, ...)
That way you can model everything you want. That record writable could have a fixed schema, like it contains 'IntWritable, Text, IntWritable, IntWritable' (depending on you fields). Or in case you don't wanna support different types, you could use the existing ArrayWritable as your 'record'.
Knowing the schema of each file (e.g. put it into the metadata of the sequence file, will allow you to do reads on files with different/evolved schema's.
So its a lot of handcrafting, but build can very efficient and flexible structure. Never used it, but take a look at http://pangool.net/userguide/schemas.html, think they already modeled suche a flexible record/tuple schema on top of sequence files.
Bottom line, i think you can achieve what you want with sequence files.
However i would recommend to also have a look at columnar file formats like Parquet or ORC files. Those come with their own tradeoffs, but you will have a higher compression rate and selective reads (column projection, filter pushdown). Also you don't have to invent the schema/tuple structure.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Can HBase Access Text Documents and CSV Documents Just as Hadoop?

In Hadoop, I can easily create Map/Reduce apps which access and process data in huge text files and csv files. My question is can Hbase do the same and access such huge files, or HBase has other uses?
Hbase runs queries just as relational databases; so, I kind of have a hard time to understand the advantage of HBase, unless it can access huge text and csv files just as Hadoop does.
First of all Hbase is just a store. And a store never accesses anything. Rather you access the store to fetch or put the data. Like any other datastore Hbase has only one job to do, store your data and make it available to you whenever you need it. You can write MapReduce jobs or sequential Java programs etc etc to put data into Hbase or fetch data from it. It's totally upto you which path you prefer.
Coming to the second part of your question, Hbase never ever works like traditional relational databases. Everything, starting from storing the data to accessing the data, is totally different. The advantage of using Hbase is that you can store really really huge amount of data into it and have random read/write access. The data can be of any type viz. text, csv, tsv, binary etc etc. But, before going ahead, you must think well whether Hbase is a suitable choice for you or not, as one size doesn't fit all.
HTH

Resources