Would someone help me form a script in Bash to keep only the unique lines, based solely on identifying duplicate values in a single field (the first field)
If I have data like this:
123456,23423,Smith,John,Jacob,Main St.,,Houston,78003<br>
654321,54524,Smith,Jenny,,Main St.,,Houston,78003<br>
332423,9023432,Gonzales,Michael,,Everyman,,Dallas,73423<br>
123456,324324,Bryant,Kobe,,Special St.,,New York,2311<br>
234324,232411,Willis,Bruce,,Sunset Blvd,,Hollywood,90210<br>
438329,34233,Moore,Mike,,Whatever,,Detroit,92343<br>
654321,43234,Smith,Jimbo,,Main St.,,Houston,78003<br>
And I like to only keep the lines which do not have matching first fields
(result would be a file with these contents below, based on above sample)
332423,9023432,Gonzales,Michael,,Everyman,,Dallas,73423<br>
234324,232411,Willis,Bruce,,Sunset Blvd,,Hollywood,90210<br>
438329,34233,Moore,Mike,,Whatever,,Detroit,92343<br>
What would the bash/awk approach be? Thanks in advance.
I have a requirement that my MAP should read a big HDFS text file and writes it to the sequence file as "text_file_name text_file_contents" as key:value pair in a single line.
My Mapper then sends the path of this sequence file to Reducer.
Currently what I am doing is :
read all lines from a text file and keep appending them to Text() variable (e.g. "contents").
once done reading the whole text file, write "contents" into sequence file
However, I am not sure whether Text() is able to store a big file. Hence want to do the following :
read a single line from text file
write it to sequence file using (writer.append(key, value) where "writer" is SequenceFile.Writer)
do above until the whole text file is written.
The problem with this approach is, it writes the "key" with every line I am writing to the sequence file.
So, just want to know,
if Text() can store a file of any size if I keep on appending it?
how can I avoid writing "key" in writer.append() in all writes but the first?
can writer.appendRaw() be used. I did not get sufficient documentation on this function.
To answer to your questions:
Text() can store upto a maximum of 2GB.
You can avoid writing a key by either writing a NullWritable or set key.ignore to false.
But when you go with the second approach first time also you cannot write your key. So better use NullWritable
Using the Read from Text File Function I am able to easily read the first line of my file. However I now want it to read the second line. It would be great to just a for loop or something if I could specify the line number somewhere. Is there a way to do so? Thanks!
First, you can read the entire file as lines by right-clicking on the Read From Text File node and selecting "Read Lines". One read will return an array containing one element for each line and you can work with the lines with regular array handling methods. If you want to read each line individually, you can by wiring a 1 into the Count input and looping. Each iteration will return an array with one element (the current line read). You can get/set the offset (in bytes) to specify where in the file you want to read, but that's not necessary if I read your question correctly.
I have a number in Mathematica, a large number. I have even gotten this number in base 16 form, using OutputForm[]. I am basically trying to write out a number to a file in hex format.
Please keep in mind I am using 123456 in these examples instead of my 70,000 digit number.
Whenever I write a file using a simple Put[123456, "file.raw"] command, I get a raw data file with the actual data 3132333435360A with a line ending.
If I use Put[OutputForm[BaseForm[123456, 16]], "file.raw"] command, I get a raw data file with the data in hex format 31653234300A202020202031360A but still not written as raw data.
I would like the Hex Form of the Number Dumped as Data.
I have tried Export, BinaryWrite, and DumpSave, but can't figure it out.
I just am getting a headache I guess cause I can't see past what I need to do.
One thing I did try was doing:
Export["file.raw", 123456];
But the file is not raw enough. What I mean by that is there is there is header data and extra crap.
Would love to get this working thanks.
Please let us know what you expect to see in your output file, and what you want use it for. Do you want something a human can read, or something in a specified format to be used by a computer? Please provide an example.
The two examples using Put[] correctly provide files containing ASCII characters corresponding to the text representations of your inputs, and which are human-readable.
I think what you're looking for is IntegerString[_,16]:
In[33]:= IntegerString[123456, 16]
Out[33]= "1e240"
str = OpenWrite[];
WriteString[str, IntegerString[123456, 16]];
Close[str];
FilePrint[%]
1e240
(using WriteString instead of Put avoids having the string characters
I need to add text before the last line in a text file in windows using command. Can any one please suggest a method?
Thanks in advance.
You can do it in 5 simple steps, and you can use any language of your choice or according to your other requirements ( like c, c++, java etc.)
Read Complete File
Extract Last line by looking for last newline
Store Last Line and erase it from file
Append New Text
Append last line which you deleted previously.
Don't forget to close your file.