Questions tagged [aws-glue]
AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.
4,823
questions
47
votes
8
answers
37k
views
AWS Glue Crawler Not Creating Table
I have a crawler I created in AWS Glue that does not create a table in the Data Catalog after it successfully completes.
The crawler takes roughly 20 seconds to run and the logs show it successfully ...
43
votes
7
answers
69k
views
How do I write messages to the output log on AWS Glue?
AWS Glue jobs log output and errors to two different CloudWatch logs, /aws-glue/jobs/error and /aws-glue/jobs/output by default. When I include print() statements in my scripts for debugging, they get ...
43
votes
5
answers
18k
views
How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')
As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table ...
43
votes
9
answers
33k
views
Can I test AWS Glue code locally?
After reading Amazon docs, my understanding is that the only way to run/test a Glue script is to deploy it to a dev endpoint and debug remotely if necessary. At the same time, if the (Python) code ...
42
votes
4
answers
48k
views
DynamicFrame vs DataFrame
What is the difference? I know that DynamicFrame was created for AWS Glue, but AWS Glue also supports DataFrame. When should DynamicFrame be used in AWS Glue?
35
votes
5
answers
32k
views
Is AWS Lambda preferred over AWS Glue Job?
In AWS Glue job, we can write some script and execute the script via job.
In AWS Lambda too, we can write the same script and execute the same logic provided in above job.
So, my query is not whats ...
34
votes
3
answers
23k
views
What is transformation_ctx used for in aws glue?
There are a lot of methods in API which received this with default "" value.
Is it just string marker but again what it purpose?
31
votes
6
answers
36k
views
AWS Glue to Redshift: Is it possible to replace, update or delete data?
Here are some bullet points in terms of how I have things setup:
I have CSV files uploaded to S3 and a Glue crawler setup to create the table and schema.
I have a Glue job setup that writes the data ...
27
votes
6
answers
21k
views
Could not find S3 endpoint or NAT gateway for subnetId
I am unable to connect AWS Glue with RDS
VPC S3 endpoint validation failed for SubnetId: subnet-7e8a2. VPC: vpc-4d2d25.
Reason: Could not find S3 endpoint or NAT gateway for subnetId: subnet-7ea32 in ...
26
votes
3
answers
33k
views
Overwrite parquet files from dynamic frame in AWS Glue
I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this:
glueContext.write_dynamic_frame....
26
votes
2
answers
60k
views
AWS Glue Job Input Parameters
I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each ...
26
votes
3
answers
21k
views
What actions does job.commit perform in aws glue?
Every job script code should be ended with job.commit() but what exact action this function do?
Is it just job end marker or not?
Can it be called twice during one job (if yes - in what cases)?
Is it ...
25
votes
4
answers
43k
views
At least one security group must open all ingress ports. AWS Glue connecting to RDS
I am still starting out with AWS Glue and I am trying to connect it to my publicly accessible MySql database hosted on RDS Aurora to get its data.
So I start by creating a crawler and in the data ...
25
votes
5
answers
36k
views
AWS Glue: How to handle nested JSON with varying schemas
Objective:
We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum.
Background:
The ...
23
votes
6
answers
26k
views
Can we consider AWS Glue as a replacement for EMR?
Just a quick question to clarify from Masters, since AWS Glue as an ETL tool, can provide companies with benefits such as, minimal or no server maintenance, cost savings by avoiding over-provisioning ...
22
votes
6
answers
45k
views
AWS Glue executor memory limit
I found that AWS Glue set up executor's instance with memory limit to 5 Gb --conf spark.executor.memory=5g and some times, on a big datasets it fails with java.lang.OutOfMemoryError. The same is for ...
22
votes
4
answers
16k
views
AWS Glue pricing against AWS EMR
I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue.
I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for ...
22
votes
1
answer
26k
views
Spark dynamic frame show method yields nothing
So I am using AWS Glue auto-generated code to read csv file from S3 and write it to a table over a JDBC connection. Seems simple, Job runs successfully with no error but it writes nothing. When I ...
21
votes
8
answers
26k
views
Optional job parameter in AWS Glue?
How can I implement an optional parameter to an AWS Glue Job?
I have created a job that currently have a string parameter (an ISO 8601 date string) as an input that is used in the ETL job. I would ...
21
votes
4
answers
38k
views
Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1,
my jobs start to fail when processing timestamps prior to 1900 with this error:
An error occurred while calling ...
21
votes
1
answer
16k
views
AWS Athena concurrency limits: Number of submitted queries VS number of running queries
According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my ...
20
votes
4
answers
23k
views
Add a partition on glue table via API on AWS?
I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search ...
20
votes
2
answers
29k
views
AWS Glue issue with double quote and commas
I have this CSV file:
reference,address
V7T452F4H9,"12410 W 62TH ST, AA D"
The following options are being used in the table definition
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2....
19
votes
3
answers
59k
views
convert spark dataframe to aws glue dynamic frame
I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error
'DataFrame' object has no attribute 'fromDF'"
My code uses heavily spark dataframes. Is ...
19
votes
3
answers
27k
views
How to use extra files for AWS glue job
I have an ETL job written in python, which consist of multiple scripts with following directory structure;
my_etl_job
|
|--services
| |
| |-- __init__.py
| |-- dynamoDB_service.py
|
|-- ...
17
votes
4
answers
22k
views
How to list all databases and tables in AWS Glue Catalog?
I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console.
How can I access the catalog and list all databases and tables? ...
16
votes
9
answers
22k
views
AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3
Part One :
I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned.
But the demo data of ELB in ...
16
votes
13
answers
43k
views
Use AWS Glue Python with NumPy and Pandas Python Packages
What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy ...
16
votes
5
answers
26k
views
How to Convert Many CSV files to Parquet using AWS Glue
I'm using AWS S3, Glue, and Athena with the following setup:
S3 --> Glue --> Athena
My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data.
Since ...
16
votes
1
answer
10k
views
How do I set multiple --conf table parameters in AWS Glue?
Multiple Answers on stackoverflow for AWS Glue say to set the --conf table parameter. However, sometimes in a job we'll need to set multiple --conf key value pairs in 1 job.
I've tried the following ...
16
votes
4
answers
34k
views
How can I use an external python library in AWS Glue?
First stack overflow question here. Hope I do this correctly:
I need to use an external python library in AWS glue. "Openpyxl" is the name of the library.
I follow these directions: https://docs.aws....
16
votes
2
answers
8k
views
AWS Glue vs EMR Serverless
Recently, AWS announced Amazon EMR Serverless (Preview) https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/ - new very ...
15
votes
3
answers
15k
views
AWS Glue takes a long time to finish
I just run a very simple job as follows
glueContext = GlueContext(SparkContext.getOrCreate())
l_table = glueContext.create_dynamic_frame.from_catalog(
database="gluecatalog",
...
15
votes
4
answers
19k
views
AWS Glue cannot create database from crawler: permission denied
I am trying to use an AWS Glue crawler on an S3 bucket to populate a Glue database. I run the Create Crawler wizard, select my datasource (the S3 bucket with the avro files), have it create the IAM ...
15
votes
2
answers
22k
views
AWS Glue output file name
I am using AWS to transform some JSON files. I have added the files to Glue from S3. The job I have set up reads the files in ok, the job runs successfully, there is a file added to the correct S3 ...
15
votes
6
answers
10k
views
AWS Glue Crawler adding tables for every partition?
I have several thousand files in an S3 bucket in this form:
├── bucket
│ ├── somedata
│ │ ├── year=2016
│ │ ├── year=2017
│ │ │ ├── month=11
│ │ | │ ├── sometype-2017-11-01....
15
votes
5
answers
12k
views
How set name for crawled table?
AWS crawler has prefix property for adding new tables. So If I leave prefix empty and start crawler to s3://my-bucket/some-table-backup it creates table with name some-table-backup. Is there a way to ...
15
votes
1
answer
3k
views
Exception with Table identified via AWS Glue Crawler and stored in Data Catalog
I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here.
So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue.
...
15
votes
0
answers
10k
views
Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR
I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I ...
14
votes
4
answers
26k
views
AWS Glue job consuming data from external REST API
I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources.
Is that even possible? Anyone does it?
Please ...
14
votes
4
answers
18k
views
How to overcome Spark "No Space left on the device" error in AWS Glue Job
I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error
...
14
votes
2
answers
15k
views
AWS Athena - GENERIC_INTERNAL_ERROR: Number of partition values does not match number of filters
I'm querying a table in Athena that is giving the error: GENERIC_INTERNAL_ERROR: Number of partition values does not match number of filters
I was able to query it earlier, but added another ...
14
votes
1
answer
2k
views
using AWS Glue with Apache Avro on schema changes
I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case:
We have an s3 bucket with a number of Avro files. We have decided to ...
14
votes
1
answer
2k
views
Glue Dynamic Frame is way slower than regular Spark
In the image below we have the same glue job run with three different configurations in terms of how we write to S3:
We used a dynamic frame to write to S3
We used a pure spark frame to write to S3
...
13
votes
3
answers
14k
views
glue job for redshift connection: "Unable to find suitable security group"
I'm trying to set up a AWS Glue job and make a connection to Redshift.
I'm getting error when I set the connection type to Redshift:
"Unable to find a suitable security group. Change connection ...
13
votes
4
answers
23k
views
Event based trigger of AWS Glue Crawler after a file is uploaded into a S3 Bucket?
Is it possible to trigger an AWS Glue crawler on new files, that get uploaded into a S3 bucket, given that the crawler is "pointed" to that bucket? In other words: a file upload generates an event, ...
13
votes
2
answers
22k
views
How to solve this HIVE_PARTITION_SCHEMA_MISMATCH?
I have partitioned data in CSV files on S3:
s3://bucket/dataset/p=1/*.csv (partition #1)
...
s3://bucket/dataset/p=100/*.csv (partition #100)
I run a classifier over s3://bucket/dataset/ and the ...