Question

I am trying to read a parquet through pyspark in a jupyter notebook.

sc = SparkSession.builder.getOrCreate()
conf = SparkConf().setAppName("classifier")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key",key)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "Cloud Object Storage - Amazon S3  - AWS ")
sc._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")

sc._jsc.hadoopConfiguration().setInt("fs.s3a.connection.maximum", 100)
sc._jsc.hadoopConfiguration().set("fs.s3a.buffer.dir", "/var/tmp/spark")

sql = SQLContext(sc)

spark = SparkSession(sc)
#sc = SparkContext.getOrCreate() 
path="s3a://s3test-dev/classifier/final_sample.parquet"
df=spark.read.parquet(path)

I get this error message below:

Illegal character in authority at index 8: https://Cloud Object Storage - Amazon S3 - AWS.

Can somebody help me what’s going on here.

Thanks

IllegalArgumentException: java.net.URISyntaxException: While accessing s3 bucket data through pyspark

LEAVE A COMMENT Hủy