Highest scored 'aws-glue' questions

47 votes

8 answers

37k views

AWS Glue Crawler Not Creating Table

I have a crawler I created in AWS Glue that does not create a table in the Data Catalog after it successfully completes. The crawler takes roughly 20 seconds to run and the logs show it successfully ...

Vince

611

asked Nov 1, 2017 at 17:02

43 votes

7 answers

69k views

How do I write messages to the output log on AWS Glue?

AWS Glue jobs log output and errors to two different CloudWatch logs, /aws-glue/jobs/error and /aws-glue/jobs/output by default. When I include print() statements in my scripts for debugging, they get ...

Jesse Clark

1,190

asked Feb 21, 2018 at 19:51

43 votes

5 answers

18k views

How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table ...

rjmurt

1,195

asked Sep 15, 2017 at 13:44

43 votes

9 answers

33k views

Can I test AWS Glue code locally?

After reading Amazon docs, my understanding is that the only way to run/test a Glue script is to deploy it to a dev endpoint and debug remotely if necessary. At the same time, if the (Python) code ...

lfk

2,523

asked Jan 18, 2018 at 5:15

42 votes

4 answers

48k views

DynamicFrame vs DataFrame

What is the difference? I know that DynamicFrame was created for AWS Glue, but AWS Glue also supports DataFrame. When should DynamicFrame be used in AWS Glue?

Alex Oh

461

asked Oct 15, 2018 at 18:12

35 votes

5 answers

32k views

Is AWS Lambda preferred over AWS Glue Job?

In AWS Glue job, we can write some script and execute the script via job. In AWS Lambda too, we can write the same script and execute the same logic provided in above job. So, my query is not whats ...

john

1,045

asked Aug 26, 2020 at 14:29

34 votes

3 answers

23k views

What is transformation_ctx used for in aws glue?

There are a lot of methods in API which received this with default "" value. Is it just string marker but again what it purpose?

Cherry

32.4k

asked Jan 17, 2018 at 12:02

31 votes

6 answers

36k views

AWS Glue to Redshift: Is it possible to replace, update or delete data?

Here are some bullet points in terms of how I have things setup: I have CSV files uploaded to S3 and a Glue crawler setup to create the table and schema. I have a Glue job setup that writes the data ...

krchun

1,014

asked Sep 14, 2017 at 21:08

27 votes

6 answers

21k views

Could not find S3 endpoint or NAT gateway for subnetId

I am unable to connect AWS Glue with RDS VPC S3 endpoint validation failed for SubnetId: subnet-7e8a2. VPC: vpc-4d2d25. Reason: Could not find S3 endpoint or NAT gateway for subnetId: subnet-7ea32 in ...

user11448446

271

asked May 3, 2019 at 15:25

26 votes

3 answers

33k views

Overwrite parquet files from dynamic frame in AWS Glue

I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this: glueContext.write_dynamic_frame....

Mateo Rod

594

asked Aug 24, 2018 at 9:47

26 votes

2 answers

60k views

AWS Glue Job Input Parameters

I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each ...

Sauron

6,519

asked Sep 13, 2018 at 15:08

26 votes

3 answers

21k views

What actions does job.commit perform in aws glue?

Every job script code should be ended with job.commit() but what exact action this function do? Is it just job end marker or not? Can it be called twice during one job (if yes - in what cases)? Is it ...

Cherry

32.4k

asked Jan 14, 2018 at 8:35

25 votes

4 answers

43k views

At least one security group must open all ingress ports. AWS Glue connecting to RDS

I am still starting out with AWS Glue and I am trying to connect it to my publicly accessible MySql database hosted on RDS Aurora to get its data. So I start by creating a crawler and in the data ...

Naguib Ihab

4,378

asked Jul 17, 2018 at 6:10

25 votes

5 answers

36k views

AWS Glue: How to handle nested JSON with varying schemas

Objective: We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum. Background: The ...

ehelander

253

asked Mar 23, 2018 at 21:09

23 votes

6 answers

26k views

Can we consider AWS Glue as a replacement for EMR?

Just a quick question to clarify from Masters, since AWS Glue as an ETL tool, can provide companies with benefits such as, minimal or no server maintenance, cost savings by avoiding over-provisioning ...

Yuva

2,999

asked Jan 12, 2018 at 9:09

22 votes

6 answers

45k views

AWS Glue executor memory limit

I found that AWS Glue set up executor's instance with memory limit to 5 Gb --conf spark.executor.memory=5g and some times, on a big datasets it fails with java.lang.OutOfMemoryError. The same is for ...

Alexey Bakulin

1,289

asked Feb 28, 2018 at 16:21

22 votes

4 answers

16k views

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for ...

Yuva

2,999

asked Feb 7, 2018 at 11:32

22 votes

1 answer

26k views

Spark dynamic frame show method yields nothing

So I am using AWS Glue auto-generated code to read csv file from S3 and write it to a table over a JDBC connection. Seems simple, Job runs successfully with no error but it writes nothing. When I ...

PyRaider

639

asked May 6, 2019 at 22:51

21 votes

8 answers

26k views

Optional job parameter in AWS Glue?

How can I implement an optional parameter to an AWS Glue Job? I have created a job that currently have a string parameter (an ISO 8601 date string) as an input that is used in the ETL job. I would ...

matsev

33.1k

asked Sep 4, 2018 at 8:27

21 votes

4 answers

38k views

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1, my jobs start to fail when processing timestamps prior to 1900 with this error: An error occurred while calling ...

Robert Kossendey

6,857

asked Aug 23, 2021 at 10:51

21 votes

1 answer

16k views

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my ...

Ilya Kisil

2,568

asked Jul 22, 2019 at 12:22

20 votes

4 answers

23k views

Add a partition on glue table via API on AWS?

I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search ...

Gudzo

639

asked Jun 1, 2018 at 8:08

20 votes

2 answers

29k views

AWS Glue issue with double quote and commas

I have this CSV file: reference,address V7T452F4H9,"12410 W 62TH ST, AA D" The following options are being used in the table definition ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2....

ln9187

740

asked May 15, 2018 at 15:35

19 votes

3 answers

59k views

convert spark dataframe to aws glue dynamic frame

I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error 'DataFrame' object has no attribute 'fromDF'" My code uses heavily spark dataframes. Is ...

user3476463

4,335

asked Nov 24, 2019 at 4:25

19 votes

3 answers

27k views

How to use extra files for AWS glue job

Anum Sheraz

2,549

asked Apr 14, 2020 at 21:50

17 votes

4 answers

22k views

How to list all databases and tables in AWS Glue Catalog?

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console. How can I access the catalog and list all databases and tables? ...

Jiří Mauritz

431

asked Sep 6, 2017 at 16:45

16 votes

9 answers

22k views

AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3

Part One : I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned. But the demo data of ELB in ...

Kush Vyas

5,939

asked Nov 13, 2017 at 14:41

16 votes

13 answers

43k views

Use AWS Glue Python with NumPy and Pandas Python Packages

What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy ...

jumpman23

385

asked Sep 20, 2017 at 18:42

16 votes

5 answers

26k views

How to Convert Many CSV files to Parquet using AWS Glue

I'm using AWS S3, Glue, and Athena with the following setup: S3 --> Glue --> Athena My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data. Since ...

mark s.

656

asked Apr 23, 2018 at 16:54

16 votes

1 answer

10k views

How do I set multiple --conf table parameters in AWS Glue?

Multiple Answers on stackoverflow for AWS Glue say to set the --conf table parameter. However, sometimes in a job we'll need to set multiple --conf key value pairs in 1 job. I've tried the following ...

Zambonilli

4,489

asked Apr 4, 2019 at 19:36

16 votes

4 answers

34k views

How can I use an external python library in AWS Glue?

First stack overflow question here. Hope I do this correctly: I need to use an external python library in AWS glue. "Openpyxl" is the name of the library. I follow these directions: https://docs.aws....

Marlon Holland

171

asked Oct 2, 2019 at 16:55

16 votes

2 answers

8k views

AWS Glue vs EMR Serverless

Recently, AWS announced Amazon EMR Serverless (Preview) https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/ - new very ...

alexanoid

25k

asked Dec 12, 2021 at 8:10

15 votes

3 answers

15k views

AWS Glue takes a long time to finish

I just run a very simple job as follows glueContext = GlueContext(SparkContext.getOrCreate()) l_table = glueContext.create_dynamic_frame.from_catalog( database="gluecatalog", ...

Shawn

5,200

asked Aug 29, 2017 at 19:36

15 votes

4 answers

19k views

AWS Glue cannot create database from crawler: permission denied

I am trying to use an AWS Glue crawler on an S3 bucket to populate a Glue database. I run the Create Crawler wizard, select my datasource (the S3 bucket with the avro files), have it create the IAM ...

mhamrah

9,188

asked Aug 20, 2019 at 20:54

15 votes

2 answers

22k views

AWS Glue output file name

I am using AWS to transform some JSON files. I have added the files to Glue from S3. The job I have set up reads the files in ok, the job runs successfully, there is a file added to the correct S3 ...

Ewan Peters

151

asked Feb 13, 2018 at 15:18

15 votes

6 answers

10k views

AWS Glue Crawler adding tables for every partition?

I have several thousand files in an S3 bucket in this form: ├── bucket │ ├── somedata │ │ ├── year=2016 │ │ ├── year=2017 │ │ │ ├── month=11 │ │ | │ ├── sometype-2017-11-01....

chazzmoney

221

asked Jan 22, 2018 at 0:10

15 votes

5 answers

12k views

How set name for crawled table?

AWS crawler has prefix property for adding new tables. So If I leave prefix empty and start crawler to s3://my-bucket/some-table-backup it creates table with name some-table-backup. Is there a way to ...

Cherry

32.4k

asked Jan 18, 2018 at 13:18

15 votes

1 answer

3k views

Exception with Table identified via AWS Glue Crawler and stored in Data Catalog

I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here. So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue. ...

Thiago Baldim

7,582

asked Aug 18, 2017 at 4:55

15 votes

0 answers

10k views

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I ...

Sridher

211

asked Jan 9, 2019 at 21:19

14 votes

4 answers

26k views

AWS Glue job consuming data from external REST API

I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please ...

deorst

189

asked Jan 13, 2020 at 9:55

14 votes

4 answers

18k views

How to overcome Spark "No Space left on the device" error in AWS Glue Job

I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error ...

Vigneshwaran

802

asked Dec 28, 2020 at 13:38

14 votes

2 answers

15k views

AWS Athena - GENERIC_INTERNAL_ERROR: Number of partition values does not match number of filters

I'm querying a table in Athena that is giving the error: GENERIC_INTERNAL_ERROR: Number of partition values does not match number of filters I was able to query it earlier, but added another ...

Neil Galloway

141

asked Jul 10, 2019 at 22:10

14 votes

1 answer

2k views

using AWS Glue with Apache Avro on schema changes

I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case: We have an s3 bucket with a number of Avro files. We have decided to ...

CharStar

427

asked Feb 9, 2018 at 20:58

14 votes

1 answer

2k views

Glue Dynamic Frame is way slower than regular Spark

In the image below we have the same glue job run with three different configurations in terms of how we write to S3: We used a dynamic frame to write to S3 We used a pure spark frame to write to S3 ...

justHelloWorld

6,698

asked Dec 21, 2021 at 8:25

13 votes

3 answers

14k views

glue job for redshift connection: "Unable to find suitable security group"

I'm trying to set up a AWS Glue job and make a connection to Redshift. I'm getting error when I set the connection type to Redshift: "Unable to find a suitable security group. Change connection ...

user3871

12.6k

asked Oct 2, 2017 at 18:30

13 votes

4 answers

23k views

Event based trigger of AWS Glue Crawler after a file is uploaded into a S3 Bucket?

Is it possible to trigger an AWS Glue crawler on new files, that get uploaded into a S3 bucket, given that the crawler is "pointed" to that bucket? In other words: a file upload generates an event, ...

BoIde

316

asked Feb 16, 2018 at 13:47

13 votes

2 answers

22k views

How to solve this HIVE_PARTITION_SCHEMA_MISMATCH?

I have partitioned data in CSV files on S3: s3://bucket/dataset/p=1/*.csv (partition #1) ... s3://bucket/dataset/p=100/*.csv (partition #100) I run a classifier over s3://bucket/dataset/ and the ...

Raffael

19.8k

asked Sep 11, 2019 at 13:25

Collectives™ on Stack Overflow

Questions tagged [aws-glue]

Related Tags