Tutorial: Get started with Graph and AWS
Objectives
By the end of this tutorial, you will be able to:
- Configure Aerolab to manage Aerospike clusters on AWS infrastructure.
- Create an S3 bucket and upload graph data, bulk loader JAR, and configuration files.
- Deploy an Aerospike Database cluster on AWS EC2 using Aerolab.
- Create an EMR cluster with Apache Spark to run distributed bulk loading jobs.
- Load vertex and edge data into Aerospike Graph using distributed processing.
This tutorial shows you how to use AWS infrastructure to load large-scale graph data into Aerospike Graph Service (AGS). You will configure Aerolab for AWS, create an Aerospike cluster, and use AWS Elastic MapReduce (EMR) with Apache Spark to perform a distributed bulk load operation.
AWS infrastructure and data loading
Each AWS component plays a specific role in the pipeline:
- Aerolab provisions and manages the Aerospike cluster on EC2, keeping configuration consistent across nodes.
- Amazon S3 acts as durable staging for CSV data, loader artifacts, and logs that Spark executors read and write.
- Aerospike Database provides the persistent graph store that receives vertex and edge records over the bulk loader protocol.
- EMR with Apache Spark supplies the distributed compute layer that validates, transforms, and streams S3 data into Aerospike.
Distributed bulk loading workflow
With the AWS building blocks in place, the load job connects them in a predictable sequence:
- An EMR Spark driver pulls the bulk loader JAR and config from S3, then distributes tasks to executors.
- Executors stream vertex and edge CSV files from S3, apply validation/transformation logic, and stage intermediate results in executor memory or temporary S3 paths as needed.
- Each executor writes validated graph records directly to the Aerospike cluster provisioned by Aerolab, respecting namespace and set mappings you defined.
- Spark pushes progress logs and final status artifacts back into S3 so you can monitor runs with the AWS CLI or console.
Internally, the loader still follows its core phases: preflight checks, temporary data writes, supernode extraction, edge cache generation, vertex writes, and edge writes. Framing those steps alongside S3, EMR, and Aerospike highlights which AWS components support each phase of the workflow.