I have large datasets stored as parquet in S3 that I’m trying to use on Athena, but I’m having a problem with one json column.
This json column has huge JSON’s with different schemas that I’m no trying to “explode” yet, I need the “raw” json so multiple people can use it as a source. Even tho I’m setting the column as string, it interprets the json structure that I had to save as a structure while extracting the data using spark because it’s size is bigger than the string limit for the parquet file.
I’m using the following syntax to create the table:
CREATE EXTERNAL TABLE table_name(
col1 string,
col2 timestamp,
...
data string) -> The problematic column
PARTITIONED BY (
partiton_column string) -> Not the same column as the json one
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('case.insensitive'= 'false') -> I tried to use this to mitigate the problem
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://PATH/'
TBLPROPERTIES (
'classification'='parquet')
I understand that the problem is probably the fact that the column, by default, is not case sensitive. In spark, I managed to read the file by setting:
spark.conf.set('spark.sql.caseSensitive', True)
But the problem persists in athena. Even when I try to use a crawler, it seems to just not work or ignore the case.insensitive clause.