Parquet file with json column (Multiple schemas) returns “Found duplicate column(s) in the data schema:”

  Kiến thức lập trình

I have large datasets stored as parquet in S3 that I’m trying to use on Athena, but I’m having a problem with one json column.

This json column has huge JSON’s with different schemas that I’m no trying to “explode” yet, I need the “raw” json so multiple people can use it as a source. Even tho I’m setting the column as string, it interprets the json structure that I had to save as a structure while extracting the data using spark because it’s size is bigger than the string limit for the parquet file.

I’m using the following syntax to create the table:

CREATE EXTERNAL TABLE table_name(
  col1 string,
  col2 timestamp,
  ...
  
  data string) -> The problematic column
PARTITIONED BY ( 
  partiton_column string) -> Not the same column as the json one
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ('case.insensitive'= 'false') -> I tried to use this to mitigate the problem
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://PATH/'
TBLPROPERTIES (
  'classification'='parquet')

I understand that the problem is probably the fact that the column, by default, is not case sensitive. In spark, I managed to read the file by setting:

spark.conf.set('spark.sql.caseSensitive', True)

But the problem persists in athena. Even when I try to use a crawler, it seems to just not work or ignore the case.insensitive clause.

LEAVE A COMMENT