Subtract Dates in Reducer - hadoop

The input to reducer is as follows
key: 12
List<values> :
1,2,3,2013-12-23 10:21:44
1,2,3,2013-12-23 10:21:59
1,2,3,2013-12-23 10:22:07
The output needed is as follows:
1,2,3,2013-12-23 10:21:44,15
1,2,3,2013-12-23 10:21:59,8
1,2,3,2013-12-23 10:22:07,0
Please note last column is 10:21:59 minus 10:21:44. Date(next) - Date(current)
I tried loading into memory and subtracting but it is causing java heap memory issue. Your help is highly appreciated. data size for this key is huge > 1 GB and not able to fit into main memory.

Perhaps something along the lines of this pseudocode in your reduce() method:
long lastDate = 0;
V lastValue = null;
for (V value : values) {
currentDate = parseDateIntoMillis(value);
if (lastValue != null) {
context.write(key, lastValue.toString() + "," + (currentDate - lastDate));
}
lastDate = currentDate;
lastValue = value;
}
context.write(key, lastValue.toString() + "," + 0);
Obviously there will be tidying up to do but the general idea is fairly simple.
Note that because of your requirement to include the date of the next value as part of the current value calculation, the iteration through the values skips the first write, hence the additional write after the loop to ensure all values are accounted for.
If you have any questions feel free to ask away.

you can do by following code
reduce (LongWritable key, Iterable<String> Values, context){
Date currentDate = null;
LongWritable diff = new LongWritable();
for (String value : values) {
Date nextDate = new Date(value.toString().split(",")[3]);
if (currentDate != null) {
diff.set(Math.abs(nextDate.getTime()-currentDate.getTime())/1000)
context.write(key, diff);
}
currentDate = nextDate;
}
}

Related

trying to figure out why a for each statement worked with an if statement, but throws an exception when i switch it to a for loop

Sorry in advance, as I have only been working with C# for about a month with limited history in VB years ago. It's a Mail merge kind of loop that I am trying to create for work to make their life easier. I have the dates figured out. I have a NumUpDown control setting the int myInt, and a formCount int starting at 0.
The code worked fine when I used if(formCount==0), when I switched it to
for(formCount=0;formCount<myInt;formCount++)
it now throws a
"System.NullReferenceException: 'Object reference not set to an
instance of an object.'"
I know there is probably another way to do what I am working on which is to just add sequential dates to forms a month at a time. I have the dates stored in an array myDate[31].
I am using the numUpDwn(min 1 max 31) to get myInt so we can select how many days in the month, or only print a couple days if we need to replace pages, so we can print anywhere from 1 to 31 pages.
With the if statement it would create the first page from the template (.dotx) to doc(var) copy the contents of doc to doc2 and add a new page to receive the next content paste.
I am sure this is a silly question, that someone will have a simple answer too. The loop is supposed to open the template, add the date, copy to doc2. close the original, and restart until it reached the number of pages/dates selected. Thanks for any help, this is the last section I need to finish and I am stumped. Oh, and I used the != because it was skipping the merge field, but with only 1 field not equal to anything worked.
private void BtnPrint_Click(object sender, EventArgs e)
{
var app = new Microsoft.Office.Interop.Word.Application();
var doc = new Microsoft.Office.Interop.Word.Document();
var doc2 = new Microsoft.Office.Interop.Word.Document();
//app.Visible = true;
doc = null;
doc2.PageSetup.Orientation = WdOrientation.wdOrientLandscape;
doc2.PageSetup.TopMargin = app.InchesToPoints(0.6f);
doc2.PageSetup.BottomMargin = app.InchesToPoints(0.17f);
doc2.PageSetup.LeftMargin = app.InchesToPoints(0.5f);
doc2.PageSetup.RightMargin = app.InchesToPoints(0.5f);
String fileSave;
fileSave = ("OTSU" + "_" + myDate[0].Month + "_" + myDate[0].Year + ".docx");
int formCount;
//formCount = 0;
var filepath = System.Windows.Forms.Application.StartupPath + outfile;
doc = app.Documents.Add(filepath);
doc2.Activate();
//OBJECT OF MISSING "NULL VALUE"
Object oMissing = System.Reflection.Missing.Value;
for (formCount = 0; formCount<myInt;formCount++)
{
doc.Activate();
foreach (Microsoft.Office.Interop.Word.Field field in doc.Fields)
{
Range rngFieldCode = field.Code;
String fieldText = rngFieldCode.Text;
// ONLY GETTING THE MAILMERGE FIELDS
if (fieldText.StartsWith(" MERGEFIELD"))
{
Int32 endMerge = fieldText.IndexOf("\\");
Int32 fieldNameLength = fieldText.Length - endMerge;
String fieldName = fieldText.Substring(11);
// GIVES THE FIELDNAMES AS THE USER HAD ENTERED IN .dotx FILE
fieldName = fieldName.Trim();
if (fieldName != "M_2nd__3rd")
{
field.Select();
app.Selection.TypeText(myDate[formCount].ToShortDateString());
}
formCount++;
Microsoft.Office.Interop.Word.Range dRange = doc.Content;
dRange.Copy();
doc2.Range(doc2.Content.End - 1, doc2.Content.End - 1).PasteSpecial(DataType: Microsoft.Office.Interop.Word.WdPasteOptions.wdKeepSourceFormatting);
doc2.Range(doc2.Content.End - 1, doc2.Content.End - 1).InsertBreak(Microsoft.Office.Interop.Word.WdBreakType.wdPageBreak);
Clipboard.Clear();
doc.Close(WdSaveOptions.wdDoNotSaveChanges);
}
}
}
doc2.SaveAs2("OTSU" + myDate[0].Month + "_" + myDate[0].Year + ".docx");
app.Documents.Open("OTSU" + myDate[0].Month + "_" + myDate[0].Year + ".docx");
doc.Close(WdSaveOptions.wdDoNotSaveChanges);
doc2.Close(WdSaveOptions.wdDoNotSaveChanges);
I haven’t run your code, but as far as I can see, this code would probably fail even without the formcount loop in the situation where you have more than one MERGEFIELD field because you close doc As soon as you have processed such a field, and yet the foreach loop is processing each Field in doc.Fields.
Even if that foreach loop terminates gracefully, in the next iteration of the formCount loop you are using doc.Activate(), but doc has closed so that will fail.
So I suggest that the main thing to do is consider which documents need to be open at which point for the process to work.
Some observations (not necessarily to do with your primary question)
where is myInt set?
is having a formCount++ loop and using formCount++ within the loop for every MERGEFIELD in doc Really your intention?
you might be better off testing field.Type() when filtering MAILMERGE fields rather than matching the text, at least if such fields can be set up by end users
when you process collections in Word and you are either adding or deleting members of the collection, you sometimes have to consider using a loop that starts with last member of the collection and works back towards the beginning. Not sure you need to do that in this case but since you may be “deleting” when you do your field.Select then Typetext, please bear that in mind
It may seem like a complication when you are mainly trying to sketch out the logic of your loops, but I generally find it very helpful to start using try...catch...finally blocks sooner rather than later during development.
I did find a solution for now. Since there is only one MERGFIELD, instead of trying to open doc, insert date, copy to new doc2, close doc, repeat I found I can open, insert date, copy to new doc2, undo edit on doc and repeat. At least it works for now, and I can get back to the books and learn some more while I map out the big project I have planned. I am sure I will be on here a bit with more questions. Without #slightly-snarky asking the questions he did, I wouldn't have thought of this so I have to give them credit for the answer. I did have to put the doc.Undo(); at the top of the loop and it will only work with one field. But its a start.
private void BtnPrint_Click(object sender, EventArgs e)
{
var app = new Microsoft.Office.Interop.Word.Application();
String fileSave;
fileSave = ("OTSU" + "_" + myDate[0].Month + "_" + myDate[0].Year + ".docx");
int formCount;
formCount = 0;
var filepath = System.Windows.Forms.Application.StartupPath + outfile;
var doc = new Microsoft.Office.Interop.Word.Document();
doc = app.Documents.Add(filepath);
app.Visible = true;
doc.Activate();
var doc2 = new Microsoft.Office.Interop.Word.Document();
doc2.PageSetup.Orientation = WdOrientation.wdOrientLandscape;
doc2.PageSetup.TopMargin = app.InchesToPoints(0.6f);
doc2.PageSetup.BottomMargin = app.InchesToPoints(0.17f);
doc2.PageSetup.LeftMargin = app.InchesToPoints(0.5f);
doc2.PageSetup.RightMargin = app.InchesToPoints(0.5f);
doc2.Activate();
//OBJECT OF MISSING "NULL VALUE"
Object oMissing = System.Reflection.Missing.Value;
for (formCount = 0; formCount < myInt; formCount++)
{
doc.Undo();
foreach (Microsoft.Office.Interop.Word.Field field in doc.Fields)
{
Range rngFieldCode = field.Code;
String fieldText = rngFieldCode.Text;
// ONLY GETTING THE MAILMERGE FIELDS
if (fieldText.StartsWith(" MERGEFIELD"))
{
Int32 endMerge = fieldText.IndexOf("\\");
Int32 fieldNameLength = fieldText.Length - endMerge;
String fieldName = fieldText.Substring(11);
// GIVES THE FIELDNAMES AS THE USER HAD ENTERED IN .dotx FILE
fieldName = fieldName.Trim();
if (fieldName != "M_2nd__3rd")
{
field.Select();
app.Selection.TypeText(myDate[formCount].ToShortDateString());
}
Microsoft.Office.Interop.Word.Range dRange = doc.Content;
dRange.Copy();
doc2.Range(doc2.Content.End - 1, doc2.Content.End - 1).PasteSpecial(DataType: Microsoft.Office.Interop.Word.WdPasteOptions.wdKeepSourceFormatting);
doc2.Range(doc2.Content.End - 1, doc2.Content.End - 1).InsertBreak(Microsoft.Office.Interop.Word.WdBreakType.wdPageBreak);
Clipboard.Clear();
}
}
}
doc2.SaveAs2("OTSU" + myDate[0].Month + "_" + myDate[0].Year + ".docx");
doc.Close(WdSaveOptions.wdDoNotSaveChanges);
doc2.Close(WdSaveOptions.wdDoNotSaveChanges);
app.Documents.Open("OTSU" + myDate[0].Month + "_" + myDate[0].Year + ".docx");
}
}
}

How to get new userinput in a stream while its running using Java8

I need to validate user input and if it doesn't meet the conditions then I need to replace it with correct input. So far I am stuck on two parts. Im fairly new to java8 and not so familiar with all the libraries so if you can give me advice on where to read up more on these I would appreciate it.
List<String> input = Arrays.asList(args);
List<String> validatedinput = input.stream()
.filter(p -> {
if (p.matches("[0-9, /,]+")) {
return true;
}
System.out.println("The value has to be positve number and not a character");
//Does the new input actually get saved here?
sc.nextLine();
return false;
}) //And here I am not really sure how to map the String object
.map(String::)
.validatedinput(Collectors.toList());
This type of logic shouldn't be done with streams, a while loop would be a good candidate for it.
First, let's partition the data into two lists, one list representing the valid inputs and the other representing invalid inputs:
Map<Boolean, List<String>> resultSet =
Arrays.stream(args)
.collect(Collectors.partitioningBy(s -> s.matches(yourRegex),
Collectors.toCollection(ArrayList::new)));
Then create the while loop to ask the user to correct all their invalid inputs:
int i = 0;
List<String> invalidInputs = resultSet.get(false);
final int size = invalidInputs.size();
while (i < size){
System.out.println("The value --> " + invalidInputs.get(i) +
" has to be positive number and not a character");
String temp = sc.nextLine();
if(temp.matches(yourRegex)){
resultSet.get(true).add(temp);
i++;
}
}
Now, you can collect the list of all the valid inputs and do what you like with it:
List<String> result = resultSet.get(true);

Creating sequence number in hive or Pig

I'm facing a data transformation issue :
I have the table here under with 3 columns : client, event, timestamp.
And I basically want to assign a sequence number to all events for a given client based on timestamp, which is basically the "Sequence" columns I added hereunder.
Client Event TimeStamp Sequence
C1 Ph 2014-01-30 12:15:23 1
C1 Me 2014-01-31 15:11:34 2
C1 Me 2014-01-31 17:16:05 3
C2 Me 2014-02-01 09:22:52 1
C2 Ph 2014-02-01 17:22:52 2
I can't figure out how to create this sequence number in hive or Pig. Would you have any clue ?
Thanks in advance !
Guillaume
Put all the records in a bag (by say grouping all), sort the tuples inside bag by TimeStamp field and then use Enumerate function.
Something like below (I did not execute the code, so you might need to clean it up a bit):
// assuming input contains 3 columns - client, event, timestamp
input2 = GROUP input all;
input3 = FOREACH input2
{
sorted = ORDER input BY timestamp;
sorted2 = Enumerate(sorted);
GENERATE FLATTEN(sorted2);
}
We eventually modified enumerate source the following way and it works great :
public void accumulate(Tuple arg0) throws IOException {
nevents=13;
i=nevents+1;
DataBag inputBag = (DataBag)arg0.get(0);
Tuple t2 = TupleFactory.getInstance().newTuple();
for (Tuple t : inputBag) {
Tuple t1 = TupleFactory.getInstance().newTuple(t.getAll());
tampon=t1.get(2).toString();
if (tampon.equals("NA souscription Credit Conso")) {
if (i <= nevents) {
outputBag.add(t2);
t2=TupleFactory.getInstance().newTuple();
}
i=0;
t2.append(t1.get(0).toString());
t2.append(t1.get(1).toString());
t2.append(t1.get(2).toString());
i++;
}
else if (i < nevents) {
t2.append(tampon);
i++;
}
else if (i == nevents) {
t2.append(tampon);
outputBag.add(t2);
i++;
t2=TupleFactory.getInstance().newTuple();
}
if (count % 1000000 == 0) {
outputBag.spill();
count = 0;
}
;
count++;
}
if (t2.size()!=0) {
outputBag.add(t2);
}
}

Getting top 100 URL from a log file

One of my friends was asked the following question in an interview. Can anyone tell me how to solve it?
We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users. How to do it?
In case we have more than 10GB RAM, just do it straight forward with hashmap.
Otherwise, separate it into several files, using a hash function. And then process each file and get a top 5. With "top 5"s for each file, it will be easy to get an overall top 5.
Another solution can be sort it using any external sorting method. And then scan the file once to count each occurrence. In the process, you don't have to keep track of the counts. You can safely throw anything that doesn't make into top5 away.
Just sort the log file according to the URLs (needs constant space if you chose an algorithm like heap sort or quick sort) and then count for each URL how many times it appears (easy, the lines with the same URLs are next to each other).
Overall complexity is O(n*Log(n)).
Why splitting in many files and keeping only top 3 (or top 5 or top N) for each file is wrong:
File1 File2 File3
url1 5 0 5
url2 0 5 5
url3 5 5 0
url4 5 0 0
url5 0 5 0
url6 0 0 5
url7 4 4 4
url7 never makes it to the top 3 in the individual files but is the best overall.
Because the log file is fairly large you should read the log-file using a stream-reader. Don't read it all in the memory.
I would expect it is feasible to have the number of possible distinct links in the memory while we work on the log-file.
// Pseudo
Hashmap map<url,count>
while(log file has nextline){
url = nextline in logfile
add url to map and update count
}
List list
foreach(m in map){
add m to list
}
sort the list by count value
take top n from the list
The runtime is O(n) + O(m*log(m)) where n is the size of the log-file in lines and where the m is number of distinct found links.
Here's a C# implementation of the pseudo-code. An actual file-reader and a log-file is not provided.
A simple emulation of reading a log-file using a list in the memory is provided instead.
The algorithm uses a hashmap to store the found links. A sorting algorithm founds the top 100 links afterward. A simple data container data-structure is used for the sorting algorithm.
The memory complexity is dependent on expected distinct links.
The hashmap must be able to contain the found distinct links,
else this algorithm won't work.
// Implementation
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
public static void Main(string[] args)
{
RunLinkCount();
Console.WriteLine("press a key to exit");
Console.ReadKey();
}
class LinkData : IComparable
{
public string Url { get; set; }
public int Count { get; set; }
public int CompareTo(object obj)
{
var other = obj as LinkData;
int i = other == null ? 0 : other.Count;
return i.CompareTo(this.Count);
}
}
static void RunLinkCount()
{
// Data setup
var urls = new List<string>();
var rand = new Random();
const int loglength = 500000;
// Emulate the log-file
for (int i = 0; i < loglength; i++)
{
urls.Add(string.Format("http://{0}.com", rand.Next(1000)
.ToString("x")));
}
// Hashmap memory must be allocated
// to contain distinct number of urls
var lookup = new Dictionary<string, int>();
var stopwatch = new System.Diagnostics.Stopwatch();
stopwatch.Start();
// Algo-time
// O(n) where n is log line count
foreach (var url in urls) // Emulate stream reader, readline
{
if (lookup.ContainsKey(url))
{
int i = lookup[url];
lookup[url] = i + 1;
}
else
{
lookup.Add(url, 1);
}
}
// O(m) where m is number of distinct urls
var list = lookup.Select(i => new LinkData
{ Url = i.Key, Count = i.Value }).ToList();
// O(mlogm)
list.Sort();
// O(m)
var top = list.Take(100).ToList(); // top urls
stopwatch.Stop();
// End Algo-time
// Show result
// O(1)
foreach (var i in top)
{
Console.WriteLine("Url: {0}, Count: {1}", i.Url, i.Count);
}
Console.WriteLine(string.Format("Time elapsed msec: {0}",
stopwatch.ElapsedMilliseconds));
}
}
Edit: This answer has been updated based on the comments
added: running time and memory complexity analysis
added: pseudo-code
added: explain how we manage a fairly large log-file

compare datetime and timespan value

I want to compare datetime value and timespan value for the reason of non-negative value checking ..
my code is here:
TimeSpan lateaftertime = new TimeSpan();
lateaftertime = Convert.ToDateTime(intime) - lateafter;
string latetime = lateaftertime.Hours + ":" + lateaftertime.Minutes;
if ((lateafter < lateaftertime))
{
Session["late"] = "00:00";
}
else
{
Session["late"] = latetime;
}
suppose late after returns negative value means the session["late"] have the value 00:00 otherwise the session maintain the difference value
please help me. i was trouble this..
Your question is pretty unclear, but it sounds like you really want:
if (lateaftertime < TimeSpan.Zero)
{
Session["late"] = "00:00";
}
else
{
Session["late] = latetime;
}
or more concisely:
Session["late"] = lateaftertime < TimeSpan.Zero ? "00:00" : latetime;
It's possible you want > rather than < here - it's hard to tell what you're trying to achieve. Sample data would make it clearer. You should also rename your variables to be more conventional, e.g. lateAfterTime instead of lateafterime

Resources