AWS CloudTrail tracks API calls made in one’s account and the all these calls are logged or can be analyzed. The output files are typically json formatted. I wrote another article here which provides some background why I needed to perform an investigative work 🙂 on these files to identify an offending application. In short, AWS deprecated some of the API calls and any application that made these calls were to be migrated.
These logs had multiple json objects in single line! And with more the API calls more the data and the number of output files. In one case I had more than 5,000 files generated over couple of days. At the client site I couldn’t get much help on code base since it was old Java code written by an outsourced company which had moved-on.
Then it became an exercise for me to use Apache Drill for the above scenario. First I took a single file and ran a Drill query:
0: jdbc:drill:zk=local> select T.jRec.eventSource, T.jRec.eventName, T.jRec.awsRegion, T.jRec.sourceIPAddress, count(*) from (select FLATTEN(Records) jRec from dfs.`/cloudtrail_logs/144702NNNNNN_CloudTrail_us-east-1_20160711T2345Z_CJPTqBCGPPc1Bhqc.json`) T group by T.jRec.eventSource, T.jRec.eventName, T.jRec.awsRegion, T.jRec.sourceIPAddress order by EXPR$1;