I have a data like this in a text file
How to print this data in table fomat usingpandas in jupyternote book ?

Just read all data and individually remove ', like so:
from io import StringIO
df_marks = pd.read_csv(StringIO("""'math','science','physics','chemistry'
df_marks.columns = df_marks.columns.str.strip("'")
df_marks = df_marks.apply(lambda x: x.str.strip("'"))


geopandas shape files coordinates

I'm currently trying to create geojson files from a set of shape files.
for shape_file in shape_files[1:]:
shp = geopandas.read_file(shape_file)
shp.to_crs(epsg = '4326')
file_name = shape_file[0:len(shape_file) - len('.shp')] + '.geojson'
print('Adding to JSON file')
shp.to_file(file_name, driver = 'GeoJSON')
print(fileName(file_name) + ' JSON file created.')
One of the problems is that the coordinates are not in the format I would like to use.
To combat this I've altered the code to edit the coordinate system but I'm now getting this error.
RuntimeError: b'no arguments in initialization list'
Any suggestions?
The dtype to put the epsg in is incorrect. If you declare epsg it must be int. So your code should look like this:
shp.to_crs(epsg = 4326)

how can i make this script suitable for converting excel files with more than one sheet inside?

import pandas as pd
from xlsx2csv import Xlsx2csv
from io import StringIO
def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
buffer = StringIO() #to read and
Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
df = pd.read_csv(buffer)
return df
Do you really need to use xlsx2csv module? If not, you could try this with Pandas.
import pandas as pd
for sheet in ['Sheet1', 'Sheet2']:
df = pd.read_excel('sample.xlsx', sheetname=sheet)

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?
Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI.
val rawFile = sparkContext.binaryFiles(...
val ready = ( here parsing with other framework
What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument
We had a scenario where we needed to use a custom decryption algorithm on the input files. We didn't want to rewrite that code in Scala or Python. Python-Spark code follows:
from pyspark import SparkContext, SparkConf, HiveContext, AccumulatorParam
def decryptUncompressAndParseFile(filePathAndContents):
'''each line of the file becomes an RDD record'''
global acc_errCount, acc_errLog
proc = subprocess.Popen(['custom_decrypt_program','--decrypt'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(unzippedData, err) = proc.communicate(input=filePathAndContents[1])
if len(err) > 0: # problem reading the file
acc_errLog.add('Error: '+str(err)+' in file: '+filePathAndContents[0]+
', on host: '+ socket.gethostname()+' return code:'+str(returnCode))
return [] # this is okay with flatMap
records = list()
iterLines = iter(unzippedData.splitlines())
for line in iterLines:
#sys.stderr.write('Line: '+str(line)+'\n')
values = [x.strip() for x in line.split('|')]
records.append( (... extract data as appropriate from values into this tuple ...) )
return records
class StringAccumulator(AccumulatorParam):
''' custom accumulator to holds strings '''
def zero(self,initValue=""):
return initValue
def addInPlace(self,str1,str2):
return str1.strip()+'\n'+str2.strip()
def main():
global acc_errCount, acc_errLog
acc_errCount = sc.accumulator(0)
acc_errLog = sc.accumulator('',StringAccumulator())
binaryFileTup = sc.binaryFiles(args.inputDir)
# use flatMap instead of map, to handle corrupt files
linesRdd = binaryFileTup.flatMap(decryptUncompressAndParseFile, True)
df = sqlContext.createDataFrame(linesRdd, ourSchema())
The custom string accumulator was very useful in identifying corrupt input files.

Query hdfs with Spark Sql

I have a csv file in hdfs, how can I query this file with spark SQL? For example I would like to make a select request on special columns and get the result to be stored again to the Hadoop distributed file system
you can achieve by creating Dataframe.
val dataFrame = spark.sparkContext
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
dataFrame.sql("<sql query>");
You should create a SparkSession. An example is here.
Load a CSV file: val df ="path to your file in HDFS").
Perform your select operation: val df2 ="field1", "field2").
Write the results back: df2.write.csv("path to a new file in HDFS")

How to use xpath to find a text node

I'm using scrap to get user informations on stack overflow. And I try to use //h2[#class="user-card-name"]/text()[1] to get that name. However I get this:
['\n Ignacio Vazquez-Abrams\n \n
Someone plz help.
You should be able to clean up surrounding whitespaces from the result easily using Python's strip() function :
In [2]: result = response.xpath('//h2[#class="user-card-name"]/text()[1]').extract()
In [3]: [r.strip() for r in result]
Out[3]: [u'Ignacio Vazquez-Abrams']
The recommended way when crawling unstructured data with scrapy is to use ItemLoaders, and scrapylib offers some very good default_input_processor and default_output_processor.
from scrapy import Item, Field
from scrapy.loader import ItemLoader
from scrapylib.processors import default_input_processor
from scrapylib.processors import default_output_processor
class MyItem(Item):
field1 = Field()
field2 = Field()
class MyItemLoader(ItemLoader):
default_item_class = MyItem
default_input_processor = default_input_processor
default_output_processor = default_output_processor
now on your spider code, populate your items with:
from myproject.items import MyItemLoader
... # on your callback
loader = MyItemLoader(response=response)
loader.add_xpath('field1', '//h2[#class="user-card-name"]/text()[1]')
... keep populating the loader
yield loader.load_item() # to return an item
Try this:
result = response.xpath('//h2[#class="user-card-name"]/text()').extract()
result = result[0].strip() if result else ''
