Hadoop Mapreduce: TextInputFormat: Meaning of position - hadoop

I am trying to understand the doc which says "The TextInputFormat works as An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text"
What does "position" mean? does it mean the line number in the file?
Given data in a file
dobbs 2007 20 18 15
dobbs 2008 22 20 12
doctor 2007 545525 366136 57313
doctor 2008 668666 446034 72694
Would it produce a map input like this?
(1, "dobbs 2007 20 18 15")
(2, "dobbs 2008 22 20 12")
(3, "doctor 2007 545525 366136 57313")
(4, "doctor 2008 668666 446034 72694")

In TextInputFormat, Keys are the byte offset in the file from the beginning of the file to the line
i.e., for the first line, offset or key will be 0
for the second line the offset or key will be length of first line for the third line offset will be offset of first line + length of first line
No, it will not produce map input as you expects,
(assuming each word is separated by single space) it would rather be something like
(0,dobbs 2007 20 18 15)
(20,dobbs 2008 22 20 12)
(40,doctor 2007 545525 366136 57313)
(71,doctor 2008 668666 446034 72694)

Related

Automating a roster with bash

I have to create a cleaning roster for an appartment building and would like to automate it with GNU bash if possible.
Requirements:
The tenants have to clean the corridor on their floor every week.
The cycle starts on Feb. 11, 2019 and lasts for 30 weeks (10x3).
There are 4 floors to my building.
There are 10 tenants capable of doing the task per floor.
The names of the tenants are in the 3rd column of the file tenants.csv, (sep = |).
The 1st column contains the appartment number which if it starts with a 2, such as in 214 means they are located on Floor number 2.
I would like to generate the dates automatically (maybe from the Date command with the week number %V which is starting on mondays) and merge in the names of the tenants from the csv file. Use of the date command and %V is way more complicated than I am used to. I don't know how to tackle this.
Desired Output (sample taken from the 2018 roster):
Week of Floor 1 Floor 2 Floor 3 Floor 4
Sep 18, Nov 27, Feb 5 Ms.X Mr.Y Ms.XX Mr.YY
Sep 25, Dec 4, Feb 19 Ms.AA Ms.BB Mr.CC Mrs.DD
...
So far, I have only this as the displaying (which i can handle i think) depends how i get the date command to give me the proper dates:
roster_start=$(date -d "20190211") # 11 fev 2019 start of cleaning roster
yr=2019; wk=6
date -d "Feb 6 $yr" +%V
date -d "20190211"
printf "\nWeek of\tFloor 1\t\tFloor 2\t\tFloor 3\t\tFloor 4\n"; \
for wk in 6 16 26 "$yr"; do
printf "%s\t" "$d"
date -d "$wk" +"%b %e"
done
Thank you for any help you can provide.

VIM visual block sort numerically using Vissort

I need help using the Vissort plugin to visual block sort numberically.
The documentation states that I can use the 'VSO n' command to set sort to 'n' for numeric but I'm having no success.
I'm on a windows machine.
The below example only has one column, but in the real world I need to be able to numerically sort on any column within a text file. For now my work around is using '!gsort.exe -k 10 -n' to sort by the 10th column.
After using 'VSO n' and running Vissort, this is how my list is sorted:
1
11
13
15
17
19
2
21
23
25
27
29
3
31
33
35
37
39
You can use GNU sort. First select visual block and then:
:'<,'>!sort -k 1 -n
Apparently, the :VSO option only applies to the :Vissort command, not to :'<,'>B sort.
So, either of these should work:
:VSO n
:'<,'>Vissort
or
:'<,'>B sort n

Sequencial log file timestamp check

I've got a large log file and I need to confirm the chronological timestamped entries sequence. I know how to read line 1 and extract the timestamp. I then need to compare it to lines 2 > line last. If there's any that are earlier than line 1 then print the whole line and continue until line last. Then read line 2 and repeat above for lines 3 > line last. Then read line 3 and repeat above for lines 4 > line last. The outside loop reading from the log file is no problem but how do I read the same file again starting at the line n+1 compared to the line number the outside loop is reading please? I.e. If the outside loop has read line 10, how do I get the inside loop to read the file starting at line 11? The log file has 10,000's line and I have several dozen log files to process, so speed is important.
Log file line format is:
Sep 17 16:09:51 2014 blah blah blah…
Sep 17 16:09:52 2014 blah blah blah…
Sep 17 16:09:52 2014 blah blah blah…
Sep 17 15:11:10 2014 blah blah blah…
Sep 17 16:11:10 2014 blah blah blah…
I'm trying to detect entries like line 4.
I can switch to Perl if it's faster.
Should I read the log file into an array to make the internal loop relative read position easy to do, or will the file size make the array size prohibitive?

Shell script Email bad formatting?

My script is perfectly fine and produce a file. The file is in plain text and is formatted like how (My expect results should look like this.) is formatted. However when I try to send my file to my email the formatting is completly wrong.
The line of code I am using to send my email.
cat ReportEmail | mail -s 'Report' bob#aol.com
The result I am getting on my email.
30129 22.65 253
96187 72.32 294
109525 82.35 295
10235 7.7 105
5906 4.44 106
76096 57.22 251
My expect results should look like this.
30129 22.65 253
96187 72.32 294
109525 82.35 295
10235 7.7 105
5906 4.44 106
76096 57.22 251
Your source file achieves the column alignment by using a combination of tabs and spaces. The width assigned to a tab, however, can vary from program to program. Widths of 4, 5, or 8 spaces, for example, are common. If you want consistent formatting in plain text from one viewer to the next, use only spaces.
As a workaround, you can expand the the tabs to spaces before passing the file to mail using the expand utility:
expand -t 8 ReportEmail.txt | mail -s 'Report' bob#aol.com
The option -t 8 tells expand to treat tabs as 8 spaces wide. Change the 8 to whatever number consistently makes the format in ReportEmail.txt work properly.

Reading a text file in Ruby gives wrong output

I am not an experienced ruby programmer, so bear with me. I have a problem with this specific text file containing two lines ( this issue shows up only on occasions) :
trim(0, 15447)
0, 15447
I am trying to read these two lines with the following code:
File.open(trim).each do |line|
puts line
end
I normally obtain the normal output, but here, I get only one line, with some characters missing:
0, 1544715447)
If I want to check the character codes, I get this:
irb(main):120:0> File.open(trim).each do |line|
irb(main):121:1* puts '========================'
irb(main):122:1> puts line
irb(main):123:1> puts '........................'
irb(main):124:1> puts line.each_byte {|c| print c, ' ' }
irb(main):125:1> end
========================
0, 1544715447)
........................
116 114 105 109 40 48 44 32 49 53 52 52 55 41 13 48 44 32 49 53 52 52 55 trim(0,0, 15447
=> #<File:E:\Public\Public_videos\Soccer\1995_0129_odp_es\950129-ODP_&m3_trim30.txt>
I frankly don't understand what is going on, as I don't see any hidden character, and this happen randomly, but consistently with some files.
Any suggestion to help me understand or avoid this issue would be greatly appreciated.
What happened is that your file had two "lines" separated by a carraige return character, and not a linefeed.
You showed the bytes in your file as
116 114 105 109 40 48 44 32 49 53 52 52 55 41 13 48 44 32 49 53 52 52 55
That 13 is a carriage return, which is sometimes "displayed" by the writer going back to the start of the line it is writing.
So first it wrote out
trim(0, 15447)
then it went back to the start of the same line and wrote
0, 15447
overlaying the initial line! What do you end up with?
0, 1544715447)
Your "problem" is probably best fixed by reencoding that text file of yours to use a better way to separate lines. On Unix systems, including OSX these days, the line terminator is character 10 - known as LINE FEED. Windows uses the two-character combination 13 10 (CR LF). Only old Mac systems to my knowledge used the 13.
Many text editors today will allow you to select a "line ending" option, so you might be able to just open that file, then save it using a different line ending option. FWIW my guess is that you are using Windows now, which is known for rendering CRs and LFs differently than *Nix systems.

Resources