Convert TSV format files to JSON line with AWS Lambda function

Kun-Hung Tsai
4 min readOct 5, 2019

Recently, I got a change to play with AWS Lambda functions.

We have a batch job that dumps DB tables to .tsv files and uploads them to S3 bucket every day.

My mission is to transform the .tsv file into JSON format, split them into 1M JSON line gzip files and upload to another S3 bucket.

Architecture

Architecture

I decided to use S3 event notification to invoke Lambda function and trigger Batch job to do the transformation.

The Lambda function will receive the upload event from S3 bucket and then trigger batch job.

The Batch job will use a docker image from ECR written in Python to transform the .tsv file into JSON line and upload to another S3 bucket.

Steps

Create new Lambda function

First of all, you need to create a new Lambda function.

Once you create a new Lambda function, you need to assign a “Execution role” for it.

Choose the execution role you create for Lambda function

Execution role

The execution role of Lambda function is actually an IAM role.

The IAM policy depends on your use case, but at least you need AWSBatchFullAccess and AWSLambdaBasicExecutionRole.

Lambda function content

The Lambda function can be written by several programming language. I use Python in my example.

The notification event will trigger lambda_handler function and send related information as function arguments (ex. bucket name, object key…).

I extract S3 bucket name and object path key from S3 notification event and set them as environment variables for Batch job.

import json
import boto3
def lambda_handler(event, context):
for record in event[‘Records’]:
bucket_name = record[‘s3’][‘bucket’][‘name’]
object_key = record[‘s3’][‘object’][‘key’]
path_key = object_key.split(‘/’)DB_NAME = path_key[0]
DB_TABLE = path_key[1]
DATE = path_key[2]
client = boto3.client(‘batch’, ‘us-west-2’)client.submit_job(
jobName = ‘batch-tsv-to-json’,
jobQueue = ‘batch-tsv-to-json’,
jobDefinition = ‘batch-tsv-to-json:1’,
containerOverrides = {
‘environment’: [
{
‘name’ : ‘DB_NAME’,
‘value’ : DB_NAME
},
{
‘name’ : ‘DB_TABLE’,
‘value’ : DB_TABLE
},
{
‘name’ : ‘DATE’,
‘value’ : DATE
}
]
})
return 0

Enable S3 bucket notification

After that, I add Notification Events in S3 bucket properties.

S3 properties

Add one notification event to notify the Lambda function. You can choose how the event will be triggered.

Every time a new object is put on S3 bucket, it will trigger Lambda function.

Don’t forget to check if the S3 notification is enabled in Lambda function.

Push image to ECR

The Batch job needs docker image to create new EC2. It can be from docker image different registries. Here I use AWS ECR service.

Create converter Python docker image

Create a docker image that converts .tsv format files to JSON line.

https://github.com/moonape1226/aws-s3-tsv-to-json/blob/master/app/converter.py

Upload to ECR

Create a new ECR, authenticate your account and push your image.

Authentication AWS account:

aws ecr get-login — region “region” — no-include-emaildocker login -u AWS -p password https://aws_account_id.dkr.ecr.us-east-1.amazonaws.com

Tag and push image:

docker tag e9ae3c220b23 aws_account_id.dkr.ecr.region.amazonaws.com/my-web-appdocker push aws_account_id.dkr.ecr.region.amazonaws.com/my-web-app

Refer to below documents for more information:

Create Batch job

To use a Batch job, you will need Batch job definition, job queue, compute environment and a ECS launch template.

Below is my Batch job definition settings, remember to set the ECR image path you just uploaded in Batch job definition:

Batch job role settings:

Refer to below documents for more information:

Check result

After all steps, you should have Batch job executed automatically every time a new object is upload to source S3 bucket.

The result should be uploaded to destination S3 bucket.

You can also check execution result of Lambda function and Batch job in Cloudwatch.

Some reminder

  • The S3 bucket, Batch job, ECR and Lambda function should be in the same AWS account and region
  • The default root disk size is 8G. If you need more, you may need to change the disk size in launch template or AMI
  • All files are gzip format in the example

Thanks for the reading. Feel free to share your thoughts and comments.

--

--