I have many files and calculate md5 line by line, and save all the md5 in database, now I get a new file and calculate md5 line by line, how I can find if any file match the new file with defined percent similarity, ex 90%, and get the matched file. what data structure should I use for space and time efficiency?
Related
I'm trying to convert Fixed width file text file to pipe delimited text file. I'm using NiFi's ReplaceText Processor for doing the same. These are my processor configurations
Replacement Strategy-Regex Replace
Evaluation Mode-Line-by-Line
Line-by-Line Evaluation Mode-All
Search Value- (.{1})(.{4})(.{16})(.{11})(.{14})(.{1})(.{8})(.{16})(.{9})(.{19})(.{5})(.{14})(.{4})(.{33})
replacement Value- ${'$1':trim()}${literal('|'):unescapeXml()}${'$3':trim()}${literal('|'):unescapeXml()}${'$4':trim()}${literal('|'):unescapeXml()}${'$5':toDecimal()}${literal('|'):unescapeXml()}${'$8':trim()}${literal('|'):unescapeXml()}${'$9':trim():toNumber()}${literal('|'):unescapeXml()}${'$10':trim()}${literal('|'):unescapeXml()}${'$11':toNumber()}${literal('|'):unescapeXml()}${'$12':toDecimal()}${literal('|'):unescapeXml()}${'$13':trim()}${literal('|'):unescapeXml()}${header:substring(63,69)}
I'm trying to split record according to the column length's provided to me and trying to trim spaces and and parsing to different types. In this process I observe that randomly some column in output file are empty strings even though the records in fixed width file contains some data. I can't figure out why the expression evaluation is inserting zero length strings randomly in the file. When I'm trying to with small set of records(some 100 records) from original file it is working fine. My original file is having 12 million records in it.
We have files with specific format in HDFS. We want to process data extracted from these files within spark. We have started to write an input format in order to create the RDD. This way we hope will be able to create an RDD from the whole file.
But each processing has to process a small subset of data contained in the file and I know how to extract this subset very efficiently, more than filtering a huge RDD.
How can I pass a query filter in the form of a String from my driver to my input format (the same way hive context does)?
Edit:
My file format is NetCDF which stores huge matrix in a efficient way for a multidimentionnal data, for exemple x,y,z and time. A first approach would be to extract all values from the matrix and produce a RDD line for each value. I'd like my inputformat to extract only a few subset of the matrix (maybe 0.01%) and build a small RDD to work with. The subset could be z = 0 and a small time period. I need to pass the time period to the input format which will retrieve only the values I'm interested in.
I guess Hive context does this when you pass an SQL query to the context. Only values matching the SQL query are present in the RDD, not all lines of the files.
I have a bunch of text files which are categorized and I would like to create a sequence file for each category in which the key is the category name and the value consists of all the textual content of all the files for the category.
I have a nosql database which has only two columns. Each row represents a file, the first column is the category name and the second one is the absolute address of the text file stored on the HDFS. My mapper reads the database and output pairs in which the key is the category and the value is the absolute address. In the reducer sides, I have the addresses of all the files for each category and I would like to create one sequence files for each category in which the key is the category name and the value consists of the all textual content of all the files belonging to that category.
A simple solution is to iterate through the pairs (in the reducer) and open files one by one and append their content to a String variable and at the end create a sequence file using MultipleOutputs. However as the file sizes may be large appending the content to a single String may not be possible. Is there any way to do this without using a String variable?
Then, since you have all the files in reducer, you can get the content of those files, and append using StringBuilder to save memory, and then discard that StringBuilder reference. If avoiding String is your question, StringBuilder is a quick way. The IO operaion involving the file access and reading is resource intensive. However the data itself, should be ok given the architecture of reducers in hadoop.
You can also think of using a combiner. However, that is mainly used to reduce the traffic between map and reduce. You can take advantage of preparing part of the sequence file, at combiner and then remaining at reducer level. ofcouse this is valid only if the content can be added as it comes and not based on specific order.
I must read Avro record serialized in avro files in HDFS. To do that, I use the AvroKeyInputFormat, so my mapper is able to work with the read records as keys.
My question is, how can I control the split size? With the text input format it consists on define the size in bytes. Here I need to define how many records every split will consist of.
I would like to manage every file in my input directory like a one big file. Have I to use CombineFileInputFormat? Is it possible to use it with Avro?
Splits honor logical record boundaries and the min and max boundaries are in bytes - text input format won't break lines in a text file even though the split boundaries are defined in bytes.
To have each file in a split, you can either set the max split size to Long.MAX_VALUE or you can override the isSplitable method in your code and return false.
how do we design mapper/reducer if I have to transform a text file line-by-line into another text file.
I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details:
the file is usually structured like this - the first row contains a comma separated list of column names. Second and the rest of the rows specify values against the columns
In some rows the trailing column values might be missing ex: if there are 15 columns then values might be specified only for the first 10 columns.
I have about 5 input files which I need to transform and aggregate into one file. the transformations are specific to each of the 5 input files.
How do I pass contextual information like file name to the mapper/reducer program?
Transformations are specific to columns so how do I remember the columns mentioned in the first row and then correlate and transform values in rows?
Split file into lines, transform (map) each line in parallel, join (reduce) the resulting lines into one file?
You can not rely on the column info in the first row. If your file is larger than a HDFS block, your file will be broken into multiple splits and each split handed to a different mapper. In that case, only the mapper receiving the first split will receive the first row with column info and the rest won't.
I would suggest passing file specific meta data in separate file and distribute it as side data. Your mapper or reducer tasks could read the meta data file.
Through the Hadoop Context object, you can get hold of the name of the file being processed by a mapper. Between all these, I think you have all the context information you are referring to and you can do file specific transformation. Even though the transformation logic is different for different files, the mapper output needs to have the same format.
If you using reducer, you could set the number of reducers to one, to force all output to aggregate to one file.