AWS Glue: A Fully Managed ETL Service
What is ETL?
ETL, a common term in enterprise data management, stands for Extraction, Transformation, and Loading. You use the ETL process to extract the data from a different sources(like RDBMS tables ,cloud etc.), apply business rules/ transform the data to the expected format, and then store the data in the data warehouse (like Redshift) to create reports/perform data analytics.
Challenges with the Traditional ETL Process
Most of the jobs in the ETL process are custom coded, even with the enterprise version of ETL tools. The custom built code needs maintenance whenever there is a change in the data structure (like adding new columns or removing columns). Per industry reports, usually 70% of the jobs in ETL are built using custom coding with no use of ETL tools. Even after coding, it may have to be tuned if data volume grows or agility of ETL jobs change.
AWS Glue is a new service to solve the challenges with conventional ETL process.
AWS Glue way of ETL?
AWS Glue was designed to give the best experience to end user and ease maintenance.With just few clicks in AWS Glue, developers will be able to load the data (to cloud), view the data, transform the data, and store the data in a data warehouse (with minimal coding). Glue consists of four components, namely AWS Glue Data Catalog,crawler,an ETL engine and scheduler. Glue Data Catalog, manages the metadata. ETL engine generates the python code to support ETL functions. As the name explains, scheduler help you to schedule the job based on the need.
Crawlers are the prime component of AWS Glue and are responsible for automatic schema inference.
To extract the schema from the data, you just have to point Glue to the source (ex: S3, if it’s data stored in AWS S3), built-in classifiers in crawlers detect the file type to extract the schema and store the record structures/data types in Glue Data Catalog.The crawlers supports a wide variety of data formats such as CSV, JSON, ORC, AVRO, and Parquet. Crawlers have the capability to detect hive style partitions if it exists in the source data and maintain partitions in the Glue created tables. Also, you can customize Glue crawlers to identify your own file types. You can schedule the crawlers to discover new data sources, schema changes and update the metadata accordingly, this differentiates Glue from rest of the ETL tools (it's not common to have schema detection in traditional ETL tools). The other advantage of Glue is that metadata loaded by crawlers can be accessed by Athena and Redshift spectrum.
Glue is serverless, so it comes with zero infrastructure expenses. As with other AWS services, you just have to pay for the resources consumed. Behind the scenes, Glue generated python code is customizable, so you can edit the code for the functionalities not covered in Glue. If the crawlers detects/updates the metadata, Glue updates the python code automatically. The code is run on a scalable Pyspark platform.
Glue ETL jobs can be triggered to run on schedule by time of the day, or can be triggered by a job completion, or through external sources like AWS lambda. AWS Glue provides the status of each job and pushes all notifications to Amazon Cloud Watch events, which can be used to monitor jobs. Once the job is completed you can access the data from target systems, like AWS SimpleStorage Service(s3), Redshift ,Relational Database Service or any JDBC compatible Data Store. You can then connect to visualization tools like AWS QuickSight to build visualization, or you can even perform data analysis.
AWS Glue Usage
The Relus Big Data team has successfully solved many of our clients ETL needs with AWS Glue. Below is an architecture in which the team used AWS Glue for one of our client to ingest the data at the rate of about 100,000 records per second. The ETL process designed with AWS Glue helped us to increase the performance by 150% and reduce the expenses by about 50%, compared to conventional ETL process.
The Relus Cloud Team can solve all your ETL needs with AWS Glue. For any assistance contact our AWS Big Data Competent team.