Parse Freebase RDF dump with MapReduce - hadoop

I downloaded the rdf data dump from Freebase and what I need to extract is the name of every entity in English in Freebase.
Do I have to use Hadoop and MapReduce to do this, if so how? Or is there another way to extract the entity names?
It would be nice if each entity title / name were on its own line in a .txt file

You could use Hadoop, but for such simple processing, you'd spend more time uncompressing and splitting the input than you would save in being able to do the search in parallel. A simple zgrep would accomplish your task in much less time.
Something along the lines of this:
zegrep $'name.*#en\t\\.$' freebase-public/rdf/freebase-rdf-2013-09-15-00-00.gz | cut -f 1,3 | gzip > freebase-names-20130915.txt.gz
will give you a compressed two column file of Freebase MIDs and their English names. You'll probably want to make the grep a little more specific to avoid false positives (and test it, which I haven't done). This file is over 20GB compressed, so it'll take a while, but less time than even getting started to prepare a Hadoop job.
If you want to do additional filtering such as only extract entities with type of /common/topic, you may find that you need to move to a scripting language like Python to be able to look at and evaluate across multiple lines at once.

No I dont think you need to use Hadoop and MapReduce to do this. You can easily create a web service to extract RDF and send to a file. Following [1] blog post explains how you can extract RDF data using WSo2 Data services server. Similarly you can use WSO2 DSS data federation [2] feature to extract RDF data and send it to a excel data sheet
[1] - http://sparkletechthoughts.blogspot.com/2011/09/extracting-rdf-data-using-wso2-data.html
[2] - http://prabathabey.blogspot.com/2011/08/data-federation-with-wso2-data-service.html

There's a screencast for Google Compute Engine that shows you how to do this as well.

Related

Apache Nifi MergeContent output data inconsistent?

Fairly new to using nifi. Need help with the design.
I am trying to create a simple flow with dummy csv files(for now) in HDFS dir and prepend some text data to each record in each flowfile.
Incoming files:
dummy1.csv
dummy2.csv
dummy3.csv
contents:
"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35,Nunavut,Storage & Organization,0.8
"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",BarryFrench,293,457.81,208.16,68.02,Nunavut,Appliances,0.58
"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",Barry French,293,46.71,8.69,2.99,Nunavut,Binders and Binder Accessories,0.39
...
Desired output:
d17a3259-0718-4c7b-bee8-924266aebcc7,Mon Jun 04 16:36:56 EDT 2018,Fellowes Recycled Storage Drawers,Allen Rosenblatt,11137,395.12,111.03,8.64,Northwest Territories,Storage & Organization,0.78
25f17667-9216-4f1d-b69c-23403cd13464,Mon Jun 04 16:36:56 EDT 2018,Satellite Sectional Post Binders,Barry Weirich,11202,79.59,43.41,2.99,Northwest Territories,Binders and Binder Accessories,0.39
ce0b569f-5d93-4a54-b55e-09c18705f973,Mon Jun 04 16:36:56 EDT 2018,Deflect-o DuraMat Antistatic Studded Beveled Mat for Medium Pile Carpeting,Doug Bickford,11456,399.37,105.34,24.49,Northwest Territories,Office Furnishings,0.61
the flow
splitText-
ReplaceText-
MergeContent-
(this may be a poor way to achieve what I am trying to get, but I saw somewhere that uuid is best bet when it comes to generating unique session id. So thought of extracting each line from incoming data to flowfile and generating uuid)
But somehow, as you can see the order of data is messing up. The first 3 rows are not the same in output. However, the test data I am using (50000 entries) seems to have the data in some other line. Multiple tests show usually the data order changes after 2001st line.
And yes, I did search similar issues here and tried using defragment method in merge but it didnt work. I would appreciate if someone can explain what is happening here and how can I get the data in the same way with unique session_id,timestamp for each record. Is there some parameter I need to change or modify to get the correct output? I am open to suggestions if there is a better way as well.
First of all thank you for such an elaborate and detailed response. I think you cleared a lot of doubts I had as to how the processor works!
The ordering of the merge is only guaranteed in defragment mode because it will put the flow files in order according to their fragment index. I'm not sure why that wouldn't be working, but if you could create a template of a flow with sample data that showed the problem it would be helpful to debug.
I will try to replicate this method using a clean template again. Could be some parameter problem and the HDFS writer not able to write.
I'm not sure if the intent of your flow is to just re-merge the original CSV that was split, or to merge together several different CSVs. Defragment mode will only re-merge the original CSV, so if ListHDFS picked up 10 CSVs, after splitting and re-merging, you should again have 10 CSVs.
Yes, that is exactly what I need. Split and join data to their corresponding files. I dont specifically (yet) need to join the outputs again.
The approach of splitting a CSV down to 1 line per flow file to manipulate each line is a common approach, however it won't perform very well if you have many large CSV files. A more efficient approach would be to try and manipulate the data in place without splitting. This can generally be done with the record-oriented processors.
I used this approach purely instinctively and did not realize this is a common method. Sometimes the datafile could be very large, that means more than a million records in a single file. Wont that be an issue with the i/o in the cluster? coz that would mean each record=one flowfile=one unique uuid. What is a comfortable number of flowfiles that nifi can handle? (i know it depends on cluster config and will try to get more info about the cluster from hdp admin)
What do you suggest by "try and manipulate the data in place without splitting" ? can you give an example or template or processor to use?
In this case you would need to define a schema for your CSV which included all the columns in your data, plus the session id and timestamp. Then using an UpdateRecord processor you would use record path expressions like /session_id = ${UUID()} and /timestamp = ${now()}. This would stream the content line by line and update each record and write it back out, keeping it all as one flow file.
This looks promising. Can you share a simple template pulling files from hdfs>processing>write hdfs files but without splitting?
I am reluctant to share the template due to restrictions. But let me see if I can create a generic templ and I will share
Thank you for your wisdom! :)

Save and Process huge amount of small files with spark

I'm new in big data! I have some questions about how to process and how to save large amount of small files(pdf and ppt/pptx) in spark, on EMR Clusters.
My goal is to save data(pdf and pptx) into HDFS(or in some type of datastore from cluster) then extract content from this file from spark and save it in elasticsearch or some relational database.
I had read the problem of small files when save data in HDFS. What is the best way to save large amount of pdf & pptx files (maxim size 100-120 MB)? I had read about Sequence Files and HAR(hadoop archive) but none of them I don't understand how exactly it's works and i don't figure out what is the best.
What is the best way to process this files? I understood that some solutions could be FileInputFormat or CombineFileInputFormat but again I don't know how exactly it's works. I know that can't run every small file on separated task because the cluster will be put in the bottleneck case.
Thanks!
If you use Object Stores (like S3) instead of HDFS then there is no need to apply any changes or conversions to your files and you can have them each as a single object or blob (this also means they are easily readable using standard tools and needn't be unpacked or reformatted with custom classes or code).
You can then read the files using python tools like boto (for s3) or if you are working with spark using the wholeTextFile or binaryFiles command and then making a BytesIO (python) / ByteArrayInputStream (java) to read them using standard libraries.
2) When processing the files, you have the distinction between items and partitions. If you have a 10000 files you can create 100 partitions containing 100 files each. Each file will need to anyways be processed one at a time since the header information is relevant and likely different for each file.
Meanwhile, I found some solutions for that small files problem in HDFS. I can use the following approaches:
HDFS Federation help us to distribute the load of namenodes: https://hortonworks.com/blog/an-introduction-to-hdfs-federation/
HBase could be also a good alternative if your files size is not too large.
There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
https://hbase.apache.org/book.html
Apache Ozone which is object storage like S3 but is on-premises. At the time of writing, from what I know, Ozone is not production ready. https://hadoop.apache.org/ozone/

Gathering heterogeneous data with hadoop

We have a system, including some Oracle and Microsoft SQL DBMS, that get data from some different sources and in different formats, stores and process it. "Different formats" means files: dbf, xls and others, including binary formats (images), which are imported to DBMS with different tools, and direct access to the databases. I want to isolate all the incoming data and store it "forever" and want to get them later by source and creation time. After some studies I want to try hadoop ecosystem, but not quite sure, if it's an adequate solution for this goal. And what parts of ecosystem should I use? HDFS alone, Hive, may be something else? Could you give me a piece of advise?
I assume you want to store the files that contain the data -- effectively a searchable file archive.
The files themselves can just be stored in HDFS ... or you may find a system like Amazon's S3 cheaper and more flexible. As you store the files, you could manage the other data about the data, namely: location, source, and creation time by appending to another file -- a simple tab-separated file or several other formats supported by Hadoop make this easy.
You can manage and query the file with Hive or other SQL-on-Hadoop tools. In effect, you're creating a simple file system with special attributes, so the trick would be to make sure that each time you write a file, you also write the metadata. You may have to handle cases like write failures, what happens when you delete, rename, or move files (I know, you say "never").
Your solution might be simpler depending on your needs, you may find that storing the data in subdirectories within HDFS (or AWS S3) is even simpler. Perhaps if you wanted to store DBF files from source "foo", and XLS files from "bar" created on December 1, 2015, you could simply create a directory structure like
/2015/12/01/foo/dbf/myfile.dbf
/2015/12/01/bar/xls/myexcel.xls
This solution has the advantage of being self-maintaining -- the file path stores the metadata which makes it very portable and simple, requiring nothing more than a shell script to implement.
I don't think there's any reason to make the solution more complicated than necessary. Hadoop or S3 are both fine for long-term, high-durability storage and for querying. My company has found that storing the information about the file in Hadoop (which we use for many other purposes) and storing the files themselves on AWS S3 is far simpler, more easily secured and much cheaper.
There are various things that you may want to do, each with their own solution. If more than 1 use case is relevant for you, you probably want to implement multiple solutions in parallel.
1. Store files for use
If you want to store files in a way that they can be picked up efficiently (distributed), the solution is simple: Put the files on hdfs
2. Store the information for use
If you want to use the information, rather than storing the files you should be interested in storing the information in a way that they can be picked up efficiently. The general solution here would be: Parse the files in a lossles way and store their information in a database
You may find that storing information in (partitioned) ORC files can be nice for this. You can do this with Pive, Pig or even UDFs (e.g. python) in Pig.
3. Keep the files for the future
In this case you would mostly care about preserving the files, and not so much about ease of access. Here the recommended solution is: Store compressed files with proper backups
Note that the replication that hdfs does is to deal more efficiently with data (and hardware issues). Just having your data on hdfs does NOT mean that it is backed up.

using wikipedia dataset for pagerank in hadoop

I will be doing a project on pagerank and inverted indexing of wikipedia dataset using apache hadoop.I downloaded the whole wiki dump - http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 .It decompresses to a single 42 Gb .xml file. I want to somehow process this file to get data suitable for input in pagerank and inverted indexing map-reduce algos. Please help! Any leads will be helpful.
You need to write your own Inputformat to process XML. You would also need to implement a RecordReader to make sure your inputsplits have the fully formed XML chunk and not just a single line. See http://www.undercloud.org/?p=408 .
Your question is not very clear to me. What kind of idea do you need?
The very first thing which is going to hit you is how are you going to process this xml file in your MR job. MR framework doesn't provide any built-in InputFormat for xml files. For this you might wanna have a look at this.

hadoop/HDFS: Is it possible to write from several processes to the same file?

f.e. create file 20bytes.
1st process will write from 0 to 4
2nd from 5 to 9
etc
I need this to parallel creating a big files using my MapReduce.
Thanks.
P.S. Maybe it is not implemented yet, but it is possible in general - point me where I should dig please.
Are you able to explain what you plan to do with this file after you have created it.
If you need to get it out of HDFS to then use it then you can let Hadoop M/R create separate files and then use a command like hadoop fs -cat /path/to/output/part* > localfile to combine the parts to a single file and save off to the local file system.
Otherwise, there is no way you can have multiple writers open to the same file - reading and writing to HDFS is stream based, and while you can have multiple readers open (possibly reading different blocks), multiple writing is not possible.
Web downloaders request parts of the file using the Range HTTP header in multiple threads, and then either using tmp files before merging the parts together later (as Thomas Jungblut suggests), or they might be able to make use of Random IO, buffering the downloaded parts in memory before writing them off to the output file in the correct location. You unfortunately don't have the ability to perform random output with Hadoop HDFS.
I think the short answer is no. The way you accomplish this is write your multiple 'preliminary' files to hadoop and then M/R them into a single consolidated file. Basically, use hadoop, don't reinvent the wheel.

Resources