1

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata integrated with the AWS Glue catalog so that I can read and write to them through spark.

The first part of the code reads some data from S3/Glue, does some transformations and what not, then writes the resulting dataframe to S3/Glue like so:

df.repartition('datekey','coeff')\
    .write\
    .format('parquet')\
    .partitionBy('datekey','coeff')\
    .mode('overwrite')\
    .option("path", S3_PATH)\
    .saveAsTable('hive_tables.my_table')

I then try to access this table with Spark SQL, but when I run something as simple as spark.sql('select * from hive_tables.my_table where datekey=20210506').show(), it throws this:

An error was encountered:
"org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 778, in saveAsTable
    self._jwrite.saveAsTable(name)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"

I've learned this happens only when specifying the datekey partition. For example, both of the following commands work fine: spark.sql('select * from hive_tables.my_table where coeff=0.5').show() and spark.sql('select * from hive_tables.my_table').show()

I've verified through Spark SQL that the partitions exist and have data in them. The datekey query also works fine through AWS Athena - just not Spark SQL.

Also Glue definitely has the two partition columns recognized:

datekey: int
coeff: double

Any ideas here? I've tried everything I can think of and it just isn't making any sense.

2
  • Have you tried casting datekey and coeff to string data type and tried writing partitioned data? May 11, 2021 at 4:16
  • @PrabhakarReddy I tried that with the datekey column, not the coeff. I have another table in the code that I can read/write fine with that coeff partition column, and a string partition column
    – hunterm
    May 11, 2021 at 15:41

3 Answers 3

1

I had same error In emr 6.3.0 (Spark 3.1.1).

After upgrade to emr 6.5.0 (Spark 3.1.2), It solved.

1
  • I used orc type files Apr 26, 2022 at 8:22
1

I had a similar issue in a similar environment (EMR cluster + Spark SQL + AWS Glue catalog). The query was like this:

select * 
from ufd.core_agg_data
where year <> date_format(current_timestamp, 'yyyy')

This is a table partitioned by "year", and "year" is a string. Note that "year" is used in the filter.

I got

User class threw exception: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown operator '!='

Then I "modified" the query to this one, and it worked!

select * 
from ufd.core_agg_data
where year in (select date_format(current_timestamp, 'yyyy'))
1
  • shouldn't be not in to obtain the same result of the first query?
    – meniluca
    Nov 16, 2023 at 15:51
0

I would still like a straight-forward solution to this, but currently this workaround suffices:

I first read the table straight from the S3 path

temp_df = spark.read.parquet(S3_PATH)

so that it doesn't use the Glue catalog as the metadata. Then I create a temp table for the session:

temp_df.createGlobalTempView('my_table')

which allows me to query it using Spark SQL with the global_temp database:

spark.sql('select * from global_temp.my_table where datekey=20210506').show()

and this works

4
  • Interesting, I was just talking about Hive and S3 and such. Can you show all your code? Also, doubt that Notebook has anything to do with it. May 11, 2021 at 17:44
  • 1
    @thebluephantom The full code is like 600 lines of pyspark, which unfortunately I didn't even write. I'm just working on transferring this already working spark application to run on the cloud instead of on prem
    – hunterm
    May 11, 2021 at 17:57
  • OK, something I am taking note of as I was looking at the Glue Catalog (as an Architect) today as well. May 11, 2021 at 17:59
  • 1
    @thebluephantom And yeah I don't think the notebook has anything to do with it, but doing it on EMR w/ the Glue catalog definitely does since this code has been running successfully for a while now off the cloud
    – hunterm
    May 11, 2021 at 18:03

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.