Attempting to make a summary tibble but not working? - summary

This is what I've put, what am I doing wrong? The error says:
Error in group_by():
! Problem adding computed columns.
Caused by error in mutate():
! Problem while computing ..2 = ... %>% dplyr:::mutate(data_1, data_summary).
Caused by error in is.factor():
! argument "object" is missing, with no default
Backtrace:
data_summary %>% ...
base::summary.default(...)
base::is.factor(object)
This is what I've used:
data_summary%>%
group_by(c(learning_time_min),
summary(
mean=mean(learning_time_min),
sd=sd(learning_time_min),
n=n(),
se=sd/sqrt(n)) %>%
dplyr:::mutate.data.frame(data_1, data_summary))
('learning time min' is a column in my dataset full of numerical data)

Related

changing CRS in GeoPandas

I'm trying to change the CRS of a geopandas dataframe. The current CRS is:
Name: unknown
Axis Info [ellipsoidal]:
- lon[east]: Longitude (degree)
- lat[north]: Latitude (degree)
Area of Use:
- undefined
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
When I try dfTrans.to_crs('epsg:4326') I get the following error:
pyproj.exceptions.CRSError: Invalid projection: epsg:4326: (Internal Proj Error: proj_create: cannot build geodeticCRS 4326: SQLite error on SELECT name, ellipsoid_auth_name, ellipsoid_code, prime_meridian_auth_name, prime_meridian_code, area_of_use_auth_name, area_of_use_code, publication_date, deprecated FROM geodetic_datum WHERE auth_name = ? AND code = ?: no such column: publication_date)
For a simple command in pyproj, pyproj.CRS.from_epsg(4326), I get the same error:
File "pyproj/_crs.pyx", line 1738, in pyproj._crs._CRS.__init__
pyproj.exceptions.CRSError: Invalid projection: epsg:4326: (Internal Proj Error: proj_create: cannot build geodeticCRS 4326: SQLite error on SELECT name, ellipsoid_auth_name, ellipsoid_code, prime_meridian_auth_name, prime_meridian_code, area_of_use_auth_name, area_of_use_code, publication_date, deprecated FROM geodetic_datum WHERE auth_name = ? AND code = ?: no such column: publication_date)
I don't know enough to know what's going on, but it seems like there's an underlying function that calls a column that doesn't exist. Any ideas how to fix this or work around it?
I got that same error when using Proj-5.x. It seems that the 'publication_date' column is a Proj-6 or Proj-7 item (which both require SQLite.)

SparkR Error while writing dataframe to csv and parquet

I'm getting error while writing spark dataframe to csv and parquet. I already try to install winutil but still not solving the error.
my code
INVALID_IMEI <- c("012345678901230","000000000000000")
setwd("D:/Revas/Jatim Old")
fileList <- list.files()
cdrSchema <- structType(structField("date","string"),
structField("time","string"),
structField("a_number","string"),
structField("b_number", "string"),
structField("duration","integer"),
structField("lac_cid","string"),
structField("imei","string"))
file <- fileList[1]
filePath <- paste0("D:/Revas/Jatim Old/",file)
dataset <- read.df(filePath, header="false",source="csv",delimiter="|",schema=cdrSchema)
dataset <- filter(dataset, ifelse(dataset$imei %in% INVALID_IMEI,FALSE,TRUE))
dataset <- filter(dataset, ifelse(isnan(dataset$imei),FALSE,TRUE))
dataset <- filter(dataset, ifelse(isNull(dataset$imei),FALSE,TRUE))
To export the dataframe, i try the following code
write.df(dataset, "D:/spark/dataset",mode="overwrite")
write.parquet(dataset, "D:/spark/dataset",mode="overwrite")
And i get the following error
Error: Error in save : org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.comma
I already found the possible cause. The issue seem to lie in the winutil version, previously im using 2.6. Changing it to 2.8 seem to solve the issue

Getting AVG in pig

I need to get the average age in each gender group...
Here is my data set:
01::F::21::0001
02::M::31::21345
03::F::22::33323
04::F::18::123
05::M::31::14567
Basically this is
userid::gender::age::occupationid
Since there is multiple delimiter, i read somewhere here in stackoverflow to load it first via TextLoader()
loadUsers = LOAD '/user/cloudera/test/input/users.dat' USING TextLoader() as (line:chararray);
testusers = FOREACH loadusers GENERATE FLATTEN(STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
grunt> DESCRIBE testusers;
testusers: {user: int,gender: chararray,age: int,occupation: int}
grouped_testusers = GROUP testusers BY gender;
average_age_of_testusers = FOREACH grouped_testusers GENERATE group, AVG(testusers.age);
after running
dump average_age_of_testusers
this is the error in hdfs
2016-10-31 13:39:22,175 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR 0: Exception while executing (Name: grouped_testusers: Local Rearrange[tuple]{chararray}(false) - scope-284 Operator Key: scope-284): org.apache.pig.backend.executionengine.ExecException:
ERROR 2106: Error while computing average in Initial 2016-10-31 13:39:22,175 [main]
ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
Input(s):
Failed to read data from "/user/cloudera/test/input/users.dat"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-169204712/tmp-1755697117"
This is my first try in programming in pig, so forgive me if the solution is very obvious.
Analyzing it further, it seems it has trouble computing the average, i thought i made a mistake in data type but age is int.
if you can help me, thank you.
I figured out the problem in this one. Please refer to How can correct data types on Apache Pig be enforced? for a better explanation.
But then, just to show what I did... I had to cast my data
FOREACH loadusers GENERATE FLATTEN((tuple(int,chararray,int,int)) STRSPLIT(line,'::')) as (user:int, gender:chararray, age:int, occupation:int);
AVG is failing because loadusers.age is being treated as string instead of int.

remove some lines in log file

I have a big log file.
After removing the timestamp of each line, I sort it by cat logfile | sort -u > logfile, so that the logs are clean and organized as
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.349 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.350 because of divided by zero
.
. (lines not shown here)
.
failed to correct PL.ASBF..HHZ.2015.364 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
.
.
. (lines not shown here)
.
.
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
.
. (lines not shown here)
.
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
I can get the logged items (e.g. PL.HSPB in above example) by
grep -oE " [0-9A-Z]*\.[0-9A-Z]*" logfile | sort -u
However, I also want to known the date info and to make it clearer, I want to remove the intermedia lines. For example,
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
.
. (lines not shown here)
.
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
after removal becomes
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
i.e., for an item, only the first and last lines are kept (the digits are year and julian day).
Is there any shell command to make this with easy?
Script:
$ cat hhz.py
#!/usr/bin/env python
import sys, re
from collections import OrderedDict
undateds = set()
firsts = OrderedDict()
lasts = OrderedDict()
while True:
line = sys.stdin.readline()
if line == '':
break
line = line.rstrip("\n")
x = re.match("(.*HHZ\.)([0-9][0-9][0-9][0-9]\.[0-9]+)( .*)", line)
if x is None:
continue
before = x.group(1)
during = x.group(2)
after = x.group(3)
undated = re.sub("(.*HHZ\.)[0-9][0-9][0-9][0-9]\.[0-9]+ (.*)", line, before+after)
if not undated in firsts:
firsts[undated] = line
lasts[undated] = line
for undated in firsts:
first = firsts[undated]
last = lasts[undated]
print first
if first != last:
print last
Input:
$ cat hhz.dat
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.349 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.350 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.364 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Something else
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
Output:
$ hhz.py < hhz.dat
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Something else
Group things by regexing out the date part. The undated is the uniqified name.
Get first in group by doing an ordered-dict put if not already set.
Get last in group by doing ordered-dict put unconditionally.
Use OrderedDict to preserve input-file ordering (use dict if you don't want that)
Check first != last to avoid printing the same thing twice in case there is only one item in the group

SparkR: "Cannot resolve column name..." when adding a new column to Spark data frame

I am trying to add some computed columns to a SparkR data frame, as follows:
Orders <- withColumn(Orders, "Ready.minus.In.mins",
(unix_timestamp(Orders$ReadyTime) - unix_timestamp(Orders$InTime)) / 60)
Orders <- withColumn(Orders, "Out.minus.In.mins",
(unix_timestamp(Orders$OutTime) - unix_timestamp(Orders$InTime)) / 60)
The first command executes ok, and head(Orders) reveals the new column. The second command throws the error:
15/12/29 05:10:02 ERROR RBackendHandler: col on 359 failed
Error in select(x, x$"*", alias(col, colName)) :
error in evaluating the argument 'col' in selecting a method for function
'select': Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Cannot resolve column name
"Ready.minus.In.mins" among (ASAP, AddressLine, BasketCount, CustomerEmail, CustomerID, CustomerName, CustomerPhone, DPOSCustomerID, DPOSOrderID, ImportedFromOldDb, InTime, IsOnlineOrder, LineItemTotal, NetTenderedAmount, OrderDate, OrderID, OutTime, Postcode, ReadyTime, SnapshotID, StoreID, Suburb, TakenBy, TenderType, TenderedAmount, TransactionStatus, TransactionType, hasLineItems, Ready.minus.In.mins);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame$$anonfun$col$1.apply(DataFrame.scala:650)
at org.apa
Do I need to do something to the data frame after adding the new column before it will accept another one?
From the link, just use backsticks, when accessing the column, e.g.:
From using
df['Fields.fields1']
or something, use:
df['`Fields.fields1`']
Found it here: spark-issues mailing list archives
SparkR isn't entirely happy with "." in a column name.

Resources