I am running a glue ETL transformation job. This job is suppose to read data from s3 and converts to parquet.
Below is the glue source.... sourcePath
is the location of the s3 file.
In this location we have around 100 million json files.. all of them are nested into sub-folders.
So that is the reason I am applying exclusionPattern
to exclude and files starting with a
(which are around 2.7 million files) and I believe that only the files starting with a
will be processed.
val file_paths = Array(sourcePath)
val exclusionPattern = "\"" + sourcePath + "{[!a]}**" + "\""
glueContext
.getSourceWithFormat(connectionType = "s3",
options = JsonOptions(Map(
"paths" -> file_paths, "recurse" -> true, "groupFiles" -> "inPartition", "exclusions" -> s"[$exclusionPattern]"
)),
format = "json",
transformationContext = "sourceDF"
)
.getDynamicFrame()
.map(transformRow, "error in row")
.toDF()
After running this job with Standard worker type and with G2 worker type as well. I keep getting error
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 27788"...
And in the cloudwatch I can see that the driver memory is getting utilised 100% but executor memory usage is almost nil.
When running the job I am setting spark.driver.memory=10g
and spark.driver.memoryOverhead=4096
and --conf job parameter.
This is the details in the logs
--conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms=60000
--conf spark.hadoop.fs.defaultFS=hdfs://ip-myip.compute.internal:1111
--conf spark.hadoop.yarn.resourcemanager.address=ip-myip.compute.internal:1111
--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.minExecutors=1
--conf spark.dynamicAllocation.maxExecutors=4
--conf spark.executor.memory=20g
--conf spark.executor.cores=16
--conf spark.driver.memory=20g
--conf spark.default.parallelism=80
--conf spark.sql.shuffle.partitions=80
--conf spark.network.timeout=600
--job-bookmark-option job-bookmark-disable
--TempDir s3://my-location/admin
--class com.example.ETLJob
--enable-spark-ui true
--enable-metrics
--JOB_ID j_111...
--spark-event-logs-path s3://spark-ui
--conf spark.driver.memory=20g
--JOB_RUN_ID jr_111...
--conf spark.driver.memoryOverhead=4096
--scriptLocation s3://my-location/admin/Job/ETL
--SOURCE_DATA_LOCATION s3://xyz/
--job-language scala
--DESTINATION_DATA_LOCATION s3://xyz123/
--JOB_NAME ETL
Any ideas what could be the issue.
Thanks