How to get text file data into table format using pandas in jupyter notebook? - text-files

I have a data like this in a text file
subject:
'math','science','physics','chemistry'
marks:
'75','62','92','90'
How to print this data in table fomat usingpandas in jupyternote book ?

Just read all data and individually remove ', like so:
from io import StringIO
df_marks = pd.read_csv(StringIO("""'math','science','physics','chemistry'
'75','62','92','90'"""))
df_marks.columns = df_marks.columns.str.strip("'")
df_marks = df_marks.apply(lambda x: x.str.strip("'"))
df_marks

Related

geopandas shape files coordinates

I'm currently trying to create geojson files from a set of shape files.
for shape_file in shape_files[1:]:
print(fileName(shape_file))
shp = geopandas.read_file(shape_file)
shp.to_crs(epsg = '4326')
file_name = shape_file[0:len(shape_file) - len('.shp')] + '.geojson'
print(file_name)
print('Adding to JSON file')
shp.to_file(file_name, driver = 'GeoJSON')
print(fileName(file_name) + ' JSON file created.')
print()
print('DONE')
One of the problems is that the coordinates are not in the format I would like to use.
To combat this I've altered the code to edit the coordinate system but I'm now getting this error.
RuntimeError: b'no arguments in initialization list'
Any suggestions?
The dtype to put the epsg in is incorrect. If you declare epsg it must be int. So your code should look like this:
shp.to_crs(epsg = 4326)
or
shp.to_crs('epsg:4326')

how can i make this script suitable for converting excel files with more than one sheet inside?

import pandas as pd
from xlsx2csv import Xlsx2csv
from io import StringIO
def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
buffer = StringIO() #to read and
Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
buffer.seek(0)
df = pd.read_csv(buffer)
return df
how can i make this script suitable for converting excel files with more than one sheet inside? It works only for xlsx file with one sheet at the moment...
Do you really need to use xlsx2csv module? If not, you could try this with Pandas.
import pandas as pd
for sheet in ['Sheet1', 'Sheet2']:
df = pd.read_excel('sample.xlsx', sheetname=sheet)

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?
Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI.
Pseudocode:
val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map ( here parsing with other framework
What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument
We had a scenario where we needed to use a custom decryption algorithm on the input files. We didn't want to rewrite that code in Scala or Python. Python-Spark code follows:
from pyspark import SparkContext, SparkConf, HiveContext, AccumulatorParam
def decryptUncompressAndParseFile(filePathAndContents):
'''each line of the file becomes an RDD record'''
global acc_errCount, acc_errLog
proc = subprocess.Popen(['custom_decrypt_program','--decrypt'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(unzippedData, err) = proc.communicate(input=filePathAndContents[1])
if len(err) > 0: # problem reading the file
acc_errCount.add(1)
acc_errLog.add('Error: '+str(err)+' in file: '+filePathAndContents[0]+
', on host: '+ socket.gethostname()+' return code:'+str(returnCode))
return [] # this is okay with flatMap
records = list()
iterLines = iter(unzippedData.splitlines())
for line in iterLines:
#sys.stderr.write('Line: '+str(line)+'\n')
values = [x.strip() for x in line.split('|')]
...
records.append( (... extract data as appropriate from values into this tuple ...) )
return records
class StringAccumulator(AccumulatorParam):
''' custom accumulator to holds strings '''
def zero(self,initValue=""):
return initValue
def addInPlace(self,str1,str2):
return str1.strip()+'\n'+str2.strip()
def main():
...
global acc_errCount, acc_errLog
acc_errCount = sc.accumulator(0)
acc_errLog = sc.accumulator('',StringAccumulator())
binaryFileTup = sc.binaryFiles(args.inputDir)
# use flatMap instead of map, to handle corrupt files
linesRdd = binaryFileTup.flatMap(decryptUncompressAndParseFile, True)
df = sqlContext.createDataFrame(linesRdd, ourSchema())
df.registerTempTable("dataTable")
...
The custom string accumulator was very useful in identifying corrupt input files.

Query hdfs with Spark Sql

I have a csv file in hdfs, how can I query this file with spark SQL? For example I would like to make a select request on special columns and get the result to be stored again to the Hadoop distributed file system
Thanks
you can achieve by creating Dataframe.
val dataFrame = spark.sparkContext
.textFile("examples/src/main/resources/people.csv")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
dataFrame.sql("<sql query>");
You should create a SparkSession. An example is here.
Load a CSV file: val df = sparkSession.read.csv("path to your file in HDFS").
Perform your select operation: val df2 = df.select("field1", "field2").
Write the results back: df2.write.csv("path to a new file in HDFS")

How to use xpath to find a text node

I'm using scrap to get user informations on stack overflow. And I try to use //h2[#class="user-card-name"]/text()[1] to get that name. However I get this:
['\n Ignacio Vazquez-Abrams\n \n
Someone plz help.
You should be able to clean up surrounding whitespaces from the result easily using Python's strip() function :
In [2]: result = response.xpath('//h2[#class="user-card-name"]/text()[1]').extract()
In [3]: [r.strip() for r in result]
Out[3]: [u'Ignacio Vazquez-Abrams']
The recommended way when crawling unstructured data with scrapy is to use ItemLoaders, and scrapylib offers some very good default_input_processor and default_output_processor.
items.py
from scrapy import Item, Field
from scrapy.loader import ItemLoader
from scrapylib.processors import default_input_processor
from scrapylib.processors import default_output_processor
class MyItem(Item):
field1 = Field()
field2 = Field()
class MyItemLoader(ItemLoader):
default_item_class = MyItem
default_input_processor = default_input_processor
default_output_processor = default_output_processor
now on your spider code, populate your items with:
from myproject.items import MyItemLoader
...
... # on your callback
loader = MyItemLoader(response=response)
loader.add_xpath('field1', '//h2[#class="user-card-name"]/text()[1]')
... keep populating the loader
yield loader.load_item() # to return an item
Try this:
result = response.xpath('//h2[#class="user-card-name"]/text()').extract()
result = result[0].strip() if result else ''

Resources