AWS CloudTrail log analysis with Apache Drill

AWS CloudTrail tracks API calls made in one’s account and the all these calls are logged or can be analyzed. The output files are typically json formatted.  I wrote another article here which provides some background why I needed to perform an investigative work 🙂 on these files to identify an offending application.  In short,  AWS deprecated some of the API calls and any application that made these calls were to be migrated.

These logs had multiple json objects in single line!  And with more the API calls more the data and the number of output files.  In one case I had more than 5,000 files generated over couple of days.  At the client site I couldn’t get much help on code base since it was old Java code written by an outsourced company which had moved-on.

Then it became an exercise for me to use Apache Drill for the above scenario.  First I took a single file and ran a Drill query:

0: jdbc:drill:zk=local> select T.jRec.eventSource, T.jRec.eventName,
   T.jRec.awsRegion, T.jRec.sourceIPAddress,
   count(*)
from (select FLATTEN(Records) jRec
     from    dfs.`/cloudtrail_logs/144702NNNNNN_CloudTrail_us-east-1_20160711T2345Z_CJPTqBCGPPc1Bhqc.json`) T
group by T.jRec.eventSource, T.jRec.eventName,
T.jRec.awsRegion, T.jRec.sourceIPAddress
order by EXPR$1;

Note the “DescribeJobFlows” call, the API call of interest to me and in the image below it is 4th from the top under column “EXPR$1”.

cloudtrail_1_file_drilled

It is so cool to perform the similar query on multiple files with simple wildcards!  The query parsed more than 4,000 files in little over a 30 secs on a single node with high load! The query spit out the following – 1958 deprecated calls made in couple of days.

cloudtrail_n_file_drilled

Once I knew the region, host IP, application it was easy to nail down the shell script that was kicking off an EMR instances that used old jar code.

In this case it was much easier to perform analysis even compared to Spark. For example, Spark expects single json record per line in the files and hence needs some preprocessing before it is fed with this data.

Note: Cloudtrail’s json data structure below

{
	"eventVersion": "1.03",
	"userIdentity": {
		"type": "IAMUser",
		"principalId": "A----------------G",
		"arn": "arn:aws:iam::1447NNNNNNNN:user/udxx_prox",
		"accountId": "1447NNNNNNNNN",
		"accessKeyId": "A-----------------A",
		"sessionContext": {
			"attributes": {},
			"sessionIssuer": {}
		},
		"userName": "udxx_prox"
	},
	"eventTime": "2016-07-11T23:49:40Z",
	"eventSource": "s3.amazonaws.com",
	"eventName": "GetBucketAcl",
	"awsRegion": "us-east-1",
	"sourceIPAddress": "AWS Internal",
	"userAgent": "[aws-internal/3]",
	"requestParameters": {
		"instanceGroupTypes": [],
		"instanceIdentity": {},
		"bucketName": "udms-prod",
		"objectIds": []
	},
	"requestID": "74AAE30BXXXXXXXX",
	"eventID": "ec769df9-833f-4f1d-90cd-830ff9b9ff43",
	"eventType": "AwsApiCall",
	"recipientAccountId": "144NNNNNNNN",
	"responseElements": {
		"clusters": []
	},
	"additionalEventData": {
		"vpcEndpointId": "vpce-2a2ed643"
	}
}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s