Spark SQL error from EMR notebook with AWS Glue table partition

Question

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata integrated with the AWS Glue catalog so that I can read and write to them through spark.

The first part of the code reads some data from S3/Glue, does some transformations and what not, then writes the resulting dataframe to S3/Glue like so:

df.repartition('datekey','coeff')\
    .write\
    .format('parquet')\
    .partitionBy('datekey','coeff')\
    .mode('overwrite')\
    .option("path", S3_PATH)\
    .saveAsTable('hive_tables.my_table')

I then try to access this table with Spark SQL, but when I run something as simple as spark.sql('select * from hive_tables.my_table where datekey=20210506').show(), it throws this:

An error was encountered:
"org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 778, in saveAsTable
    self._jwrite.saveAsTable(name)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"

I've learned this happens only when specifying the datekey partition. For example, both of the following commands work fine: spark.sql('select * from hive_tables.my_table where coeff=0.5').show() and spark.sql('select * from hive_tables.my_table').show()

I've verified through Spark SQL that the partitions exist and have data in them. The datekey query also works fine through AWS Athena - just not Spark SQL.

Also Glue definitely has the two partition columns recognized:

datekey: int
coeff: double

Any ideas here? I've tried everything I can think of and it just isn't making any sense.

Have you tried casting datekey and coeff to string data type and tried writing partitioned data? — Prabhakar Reddy, May 11, 2021 at 4:16
@PrabhakarReddy I tried that with the datekey column, not the coeff. I have another table in the code that I can read/write fine with that coeff partition column, and a string partition column — hunterm, May 11, 2021 at 15:41

Powerful Lee · Accepted Answer · 2022-04-26 08:21:37Z

1

I had same error In emr 6.3.0 (Spark 3.1.1).

After upgrade to emr 6.5.0 (Spark 3.1.2), It solved.

answered Apr 26, 2022 at 8:21

Powerful Lee

647 bronze badges

I used orc type files
– Powerful Lee
Apr 26, 2022 at 8:22

Add a comment |

alek6dj · Accepted Answer · 2023-01-13 10:39:42Z

I had a similar issue in a similar environment (EMR cluster + Spark SQL + AWS Glue catalog). The query was like this:

select * 
from ufd.core_agg_data
where year <> date_format(current_timestamp, 'yyyy')

This is a table partitioned by "year", and "year" is a string. Note that "year" is used in the filter.

I got

User class threw exception: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown operator '!='

Then I "modified" the query to this one, and it worked!

select * 
from ufd.core_agg_data
where year in (select date_format(current_timestamp, 'yyyy'))

shouldn't be not in to obtain the same result of the first query? — meniluca, Nov 16, 2023 at 15:51

hunterm · Accepted Answer · 2021-05-10 23:33:46Z

0

I would still like a straight-forward solution to this, but currently this workaround suffices:

I first read the table straight from the S3 path

temp_df = spark.read.parquet(S3_PATH)

so that it doesn't use the Glue catalog as the metadata. Then I create a temp table for the session:

temp_df.createGlobalTempView('my_table')

which allows me to query it using Spark SQL with the global_temp database:

spark.sql('select * from global_temp.my_table where datekey=20210506').show()

and this works

edited May 10, 2021 at 23:33

answered May 10, 2021 at 19:11

hunterm

2976 silver badges13 bronze badges

Interesting, I was just talking about Hive and S3 and such. Can you show all your code? Also, doubt that Notebook has anything to do with it.
– thebluephantom
May 11, 2021 at 17:44
1

@thebluephantom The full code is like 600 lines of pyspark, which unfortunately I didn't even write. I'm just working on transferring this already working spark application to run on the cloud instead of on prem
– hunterm
May 11, 2021 at 17:57
OK, something I am taking note of as I was looking at the Glue Catalog (as an Architect) today as well.
– thebluephantom
May 11, 2021 at 17:59
1

@thebluephantom And yeah I don't think the notebook has anything to do with it, but doing it on EMR w/ the Glue catalog definitely does since this code has been running successfully for a while now off the cloud
– hunterm
May 11, 2021 at 18:03

Add a comment |

Collectives™ on Stack Overflow

Spark SQL error from EMR notebook with AWS Glue table partition

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged
amazon-web-services
pyspark
apache-spark-sql
amazon-emr
aws-glue
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged amazon-web-servicespysparkapache-spark-sqlamazon-emraws-glue or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
amazon-web-services
pyspark
apache-spark-sql
amazon-emr
aws-glue
or ask your own question.