Running a mapreduce job on cloudera demo cdh3u4 (airline data example) - hadoop

I'm doing the R-Hadoop tutorial (october 2012) of Jeffrey Breen.
At the moment I try to populate hdfs and then run the commands Jeffrey published in his tutorial in RStudio. Unfortunately I got some troubles with it:
UPDATE: I now moved the data folder to:
/home/cloudera/data/hadoop/wordcount (and same for airline-Data)
No when I run populate.hdfs.sh I get the following output:
[cloudera#localhost ~]$ /home/cloudera/TutorialBreen/bin/populate.hdfs.sh
mkdir: cannot create directory /user/cloudera: File exists
mkdir: cannot create directory /user/cloudera/wordcount: File exists
mkdir: cannot create directory /user/cloudera/wordcount/data: File exists
mkdir: cannot create directory /user/cloudera/airline: File exists
mkdir: cannot create directory /user/cloudera/airline/data: File exists
put: Target /user/cloudera/airline/data/20040325.csv already exists
And then I tried the commands in RStudio as shown in the tutorial but I get errors at the end. Can someone show me what I did wrong?
> if (LOCAL)
+ {
+ rmr.options.set(backend = 'local')
+ hdfs.data.root = 'data/local/airline'
+ hdfs.data = file.path(hdfs.data.root, '20040325-jfk-lax.csv')
+ hdfs.out.root = 'out/airline'
+ hdfs.out = file.path(hdfs.out.root, 'out')
+ if (!file.exists(hdfs.out))
+ dir.create(hdfs.out.root, recursive=T)
+ } else {
+ rmr.options.set(backend = 'hadoop')
+ hdfs.data.root = 'airline'
+ hdfs.data = file.path(hdfs.data.root, 'data')
+ hdfs.out.root = hdfs.data.root
+ hdfs.out = file.path(hdfs.out.root, 'out')
+ }
> asa.csvtextinputformat = make.input.format( format = function(con, nrecs) {
+ line = readLines(con, nrecs)
+ values = unlist( strsplit(line, "\\,") )
+ if (!is.null(values)) {
+ names(values) = c('Year','Month','DayofMonth','DayOfWeek','DepTime','CRSDepTime',
+ 'ArrTime','CRSArrTime','UniqueCarrier','FlightNum','TailNum',
+ 'ActualElapsedTime','CRSElapsedTime','AirTime','ArrDelay',
+ 'DepDelay','Origin','Dest','Distance','TaxiIn','TaxiOut',
+ 'Cancelled','CancellationCode','Diverted','CarrierDelay',
+ 'WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay')
+ return( keyval(NULL, values) )
+ }
+ }, mode='text' )
> mapper.year.market.enroute_time = function(key, val) {
+ if ( !identical(as.character(val['Year']), 'Year')
+ & identical(as.numeric(val['Cancelled']), 0)
+ & identical(as.numeric(val['Diverted']), 0) ) {
+ if (val['Origin'] < val['Dest'])
+ market = paste(val['Origin'], val['Dest'], sep='-')
+ else
+ market = paste(val['Dest'], val['Origin'], sep='-')
+ output.key = c(val['Year'], market)
+ output.val = c(val['CRSElapsedTime'], val['ActualElapsedTime'], val['AirTime'])
+ return( keyval(output.key, output.val) )
+ }
+ }
> reducer.year.market.enroute_time = function(key, val.list) {
+ if ( require(plyr) )
+ val.df = ldply(val.list, as.numeric)
+ else { # this is as close as my deficient *apply skills can come w/o plyr
+ val.list = lapply(val.list, as.numeric)
+ val.df = data.frame( do.call(rbind, val.list) )
+ }
+ colnames(val.df) = c('crs', 'actual','air')
+ output.key = key
+ output.val = c( nrow(val.df), mean(val.df$crs, na.rm=T),
+ mean(val.df$actual, na.rm=T),
+ mean(val.df$air, na.rm=T) )
+ return( keyval(output.key, output.val) )
+ }
> mr.year.market.enroute_time = function (input, output) {
+ mapreduce(input = input,
+ output = output,
+ input.format = asa.csvtextinputformat,
+ output.format='csv', # note to self: 'csv' for data, 'text' for bug
+ map = mapper.year.market.enroute_time,
+ reduce = reducer.year.market.enroute_time,
+ backend.parameters = list(
+ hadoop = list(D = "mapred.reduce.tasks=2")
+ ),
+ verbose=T)
+ }
> out = mr.year.market.enroute_time(hdfs.data, hdfs.out)
Error in file(f, if (format$mode == "text") "r" else "rb") :
cannot open the connection
In addition: Warning message:
In file(f, if (format$mode == "text") "r" else "rb") :
cannot open file 'data/local/airline/20040325-jfk-lax.csv': No such file or directory
> if (LOCAL)
+ {
+ results.df = as.data.frame( from.dfs(out, structured=T) )
+ colnames(results.df) = c('year', 'market', 'flights', 'scheduled', 'actual', 'in.air')
+ print(head(results.df))
+ }
Error in to.dfs.path(input) : object 'out' not found
Thank you so much!

First of all, it looks like the command:
/usr/bin/hadoop fs -mkdir /user/cloudera/wordcount/data
Is being split into multiple lines. Make sure you're entering it as-is.
Also, it is saying that the local directory data/hadoop/wordcount does not exist. Verify that you're running this command from the correct directory and that your local data is where you expect it to be.

Related

Generate Azure Storage SAS Signature In Ruby

I am trying to use the following code to generate a valid URL for accessing a blob in my Azure storage account. The Azure account name and key are stored in .env files. For some reason, the URL doesn't work; I get a Signature did not match error.
# version 2018-11-09 and later, https://learn.microsoft.com/en-us/rest/api/storageservices/create-service-sas#version-2018-11-09-and-later
signed_permissions = "r"
signed_start = "#{(start_time - 5.minutes).iso8601}"
signed_expiry = "#{(start_time + 10.minutes).iso8601}"
canonicalized_resource = "/blob/#{Config.azure_storage_account_name}/media/#{medium.tinyurl}"
signed_identifier = ""
signed_ip = ""
signed_protocol = "https"
signed_version = "2018-11-09"
signed_resource = "b"
signed_snapshottime = ""
rscc = ""
rscd = ""
rsce = ""
rscl = ""
rsct = ""
string_to_sign = signed_permissions + "\n" +
signed_start + "\n" +
signed_expiry + "\n" +
canonicalized_resource + "\n" +
signed_identifier + "\n" +
signed_ip + "\n" +
signed_protocol + "\n" +
signed_version + "\n" +
signed_resource + "\n" +
signed_snapshottime + "\n" +
rscc + "\n" +
rscd + "\n" +
rsce + "\n" +
rscl + "\n" +
rsct
sig = OpenSSL::HMAC.digest('sha256', Base64.strict_decode64(Config.azure_storage_account_key), string_to_sign.encode(Encoding::UTF_8))
sig = Base64.strict_encode64(sig)
#result = "#{medium.storageurl}?sp=#{signed_permissions}&st=#{signed_start}&se=#{signed_expiry}&spr=#{signed_protocol}&sv=#{signed_version}&sr=#{signed_resource}&sig=#{sig}"
PS: This is in Rails and medium is a record pulled from the DB that contains information about the blob in Azure.
Turns out the issue was clock skew. The signed_start and signed_expiry amounts I was using were too tight. WHen I relaxed then to -30/+20, I could reliably create SAS tokens using the snipper I posted.

Failed to get #Query Result

Hello I'm trying to read tables related with ManyToOne , i get the result when i execute the query in Navicat :
but when i try to display data in the front with angular i failed i get only the main tables
this is the query :
//like this
#Query(value = "SELECT\n" +
"\tnotification.idnotif,\n" +
"\tnotification.message,\n" +
"\tnotification.\"state\",\n" +
"\tnotification.title,\n" +
"\tnotification.\"customData\",\n" +
"\tnotification.\"date\",\n" +
"\tnotification.receiver,\n" +
"\tnotification.sender,\n" +
"\tnotification.\"type\",\n" +
"\thospital.\"name\",\n" +
"\thospital.\"siretNumber\",\n" +
"\tusers.firstname,\n" +
"\tusers.\"isActive\" \n" +
"FROM\n" +
"\tnotification\n" +
"\tINNER JOIN hospital ON notification.receiver = :reciver\n" +
"\tINNER JOIN users ON notification.sender = :sender",nativeQuery = true)
List<Notification> findNotificationCustomQuery(#Param("reciver") Long reciver,#Param("sender") Long sender);
please what can i do to resolve this problem !
You are doing inner join in the native query. Follow as below. Change the return type to Object[] from Notification.
#Query(value = "SELECT\n" +
"\tnotification.idnotif,\n" +
"\tnotification.message,\n" +
"\tnotification.\"state\",\n" +
"\tnotification.title,\n" +
"\tnotification.\"customData\",\n" +
"\tnotification.\"date\",\n" +
"\tnotification.receiver,\n" +
"\tnotification.sender,\n" +
"\tnotification.\"type\",\n" +
"\thospital.\"name\",\n" +
"\thospital.\"siretNumber\",\n" +
"\tusers.firstname,\n" +
"\tusers.\"isActive\" \n" +
"FROM\n" +
"\tnotification\n" +
"\tINNER JOIN hospital ON notification.receiver = :reciver\n" +
"\tINNER JOIN users ON notification.sender =
:sender",nativeQuery = true)
List<Object []> findNotificationCustomQuery(#Param("reciver")
Long reciver,#Param("sender") Long sender);
Then you have to loop the result as below and get the attributes.
for(Object[] obj : result){
String is = obj[0];
//Get like above
}

Bash script to get mongoStats

Hoping someone has a bash script handy that will hit a mongodb and get the collection stats something like the below that I can use in a shell script?
var collectionNames = db.getCollectionNames(), stats = [];
collectionNames.forEach(function (n) { stats.push(db[n].stats()); });
stats = stats.sort(function(a, b) { return b['size'] - a['size']; });
for (var c in stats) { print(stats[c]['ns'] + ": " + stats[c]['size'] + " (" + stats[c]['storageSize'] + ")"); }
UPDATE
one other question --- looking to prefix the line with a datestamp
"db.getCollectionNames().forEach(function (n) { var s = db[n].stats(); print('date +'%D %r %Z'''namespace=' + s['ns'] +',count=' + s['count']+',avgObjSize=' + s['avgObjSize']+',storageSize=' + s['storageSize']) })"
but my date code doesn't seem to be working :(
mongo $DB_NAME --quiet --eval "db.getCollectionNames().forEach(function (n) { var s = db[n].stats(); print(s['ns'] + ',' + s['size'] + ',' + s['storageSize']) })" | sort --numeric-sort --reverse
It will print in a CSV format which you can you couple of tools to manipulate.
Update:
Just add avgObjSize, totalIndexSize and other keys you need, edit your main question with an output example so we can sort by whatever column you desire.
Update 2:
db.getCollectionNames().forEach(function (n) { var s = db[n].stats(); printjson({'namespace': s['ns'], 'size': s['size'], 'storage': s['storageSize']}) })
db.getCollectionNames().forEach(function (n) { var s = db[n].stats(); print('size=' + s['size'] +',avgObjSize=' + s['avgObjSize']) })

Using ACE with WT and highlighting lines. This is not working

This code is not working, I keep having a loading sign on top to the right but the editor works. Is it possible to have some help towards the actual connection from the Ace Editor file to the C++ WT file.
//Start.
editor1 = new Wt::WText(wt_root);
editor1->setText("Testing for the highlight.");
editor1->setInline(false);
//REQUIREMENT FOR THE ACEEDITOR FILE INPUTED.
Wt::WApplication::instance()->require(std::string("AceFiles/ace.js"));
//CONFIG FOR THE EDITOR THAT WILL SUPPORT TEXT.
editor = new Wt::WContainerWidget(wt_root);
editor->resize(500, 500);
range = new Wt::WContainerWidget(wt_root);
//editor_ref IS THE STRING THAT THE USER IS WRITTING.
std::string editor_ref = editor->jsRef();
std::string range_ref = range->jsRef();
std::string command =
editor_ref + ".editor = ace.edit(" + editor_ref + ");" +
range_ref + ".range = ace.require('ace/range')." + range_ref + ";" +
editor_ref + ".editor.setTheme(\"ace / theme / github\");" +
editor_ref + ".editor.getSession().setMode(\"ace/mode/assembly_x86\");" +
editor_ref + ".editor.session.addMarker(new Range(1, 0, 15, 0), \"fullLine\");";
editor->doJavaScript(command);
//CONFIG. FOR THE JSIGNAL USED.
//BEING THE CONNECTION BETWEEN THE C++ DOC AND THE JAVA SCRIPT.
jsignal = new Wt::JSignal<std::string>(editor, "textChanged");
jsignal->connect(this, &Ui_AceEditor::textChanged);
//CONFIG FOR THE BUTTON.
b = new Wt::WPushButton("Save", wt_root);
command = "function(object, event) {" +
jsignal->createCall(editor_ref + ".editor.getValue()") +
";}";
b->clicked().connect(command);

Ajax/jQuery Timing Issue

I have a click function that does a jQuery/Ajax $.post to get data from a webservice when a span is clicked. When there is a Firebug break point set on the click function, everything works as expected (some new table tr's are appended to a table). When there is no break point set, nothing happens when you click the span. Firebug doesn't show any errors. I assume from other stackoverflow questions that this is a timing problem, but I don't know what to do about it. I have tried changing from a $.post to a $.ajax and setting async to false, but that didn't fix it. Here's the code for the click handler:
$('.rating_config').click(function(event){
event.preventDefault();
event.stopPropagation();
var that = $(this);
// calculate the name of the module based on the classes of the parent <tr>
var mytrclasses = $(this).parents('tr').attr('class');
var modulestart = mytrclasses.indexOf('module-');
var start = mytrclasses.indexOf('-', modulestart) + 1;
var stop = mytrclasses.indexOf(' ', start);
var mymodule = mytrclasses.substring(start, stop);
mymodule = mymodule.replace(/ /g, '+');
mymodule = mymodule.replace(/_/g, '+');
mymodule = encodeURI(mymodule);
// calculate the name of the property based on the classes of the parent <tr>
var propertystart = mytrclasses.indexOf('property-');
var propstart = mytrclasses.indexOf('-', propertystart) + 1;
var propstop = mytrclasses.indexOf(' ', propstart);
var myproperty = mytrclasses.substring(propstart, propstop);
myproperty = myproperty.replace(/ /g, '+');
myproperty = myproperty.replace(/_/g, '+');
myproperty = encodeURI(myproperty);
var parentspanid = $(this).attr('id');
// Remove the comparison rows if they are already present, otherwise generate them
if ($('.comparison_' + parentspanid).length != 0) {
$('.comparison_' + parentspanid).remove();
} else {
$.post('http://localhost/LearnPHP/webservice.php?user=user-0&q=comparison&level=property&module=' + mymodule + '&version_id=1.0&property=' + myproperty + '&format=xml', function(data) {
var data = $.xml2json(data);
for (var propnum in data.configuration.modules.module.properties.property) {
var prop = data.configuration.modules.module.properties.property[propnum];
console.log(JSON.stringify(prop));
prop.mod_or_config = 'config';
var item_id = mymodule + '?' + prop.property_name + '?' + prop.version_id + '?' + prop.value;
item_id = convertId(item_id);
prop.id = item_id;
//alert('prop.conformity = ' + prop.conformity);
// genRow(row, module, comparison, comparison_parentspanid)
var rowstring = genRow(prop, mymodule, true, parentspanid);
console.log('back from genRow. rowstring = ' + rowstring);
$(that).closest('tr').after(rowstring);
//$('tr#node-' + data[row].id + ' span#rating' + row.id).css('background', '-moz-linear-gradient(left, #ff0000 0%, #ff0000 ' + data[row].conformity + '%, #00ff00 ' + 100 - data[row].conformity + '%, #00ff00 100%');
var conformity_color = getConformityColor(prop.conformity);
$('tr#comparison_module_' + mymodule + '_setting_' + prop.id + ' span#module_' + mymodule + '_rating' + prop.id).css({'background':'-moz-linear-gradient(left, ' + conformity_color + ' 0%, ' + conformity_color + ' ' + prop.conformity + '%, #fffff0 ' + prop.conformity + '%, #fffff0 100%)'});
//$('tr#comparison-' + data[row].id + ' span#rating' + data[row].id).css('background','-webkit-linear-gradient(left, #00ff00 0%, #00ff00 ' + data[row].conformity + '%, #ff0000 ' + (100 - (data[row].conformity + 2)) + '%, #ff0000 100%)');
}
});
// Hide the Fix by mod column
hideFixedByModCol();
$('tr.comparison_' + parentspanid).each(function(i){
if (i % 2 == 0) {
$(that).addClass('comparison_even');
} else {
$(that).addClass('comparison_odd');
}
});
}
});
Any help would be greatly appreciated!
I suspect your data is coming back improperly formed. Enclose your code from the break on within try {} catch {} to see the error generated. Also it would be a good idea to add error processing to your ajax request.

Resources