Quick rental market analysis with Python, Panda

houses_image_drawing rental_analysis_boxplot_2016

Real estate and rentals are always interesting, especially when one is in the market either as a tenant or a landlord.  Though some data is available in the market through APIs it is generally old by days or weeks or already summarized to the extent that it is not much helpful. And if you like to get most granular data to perform some analysis it gets even harder. Craigslist website is one good alternative and you can get most recent (real-time) data that one can use.  But it needs quite a bit of work to pull, extract and clean before using it.Read More »

AWS CloudTrail log analysis with Apache Drill

AWS CloudTrail tracks API calls made in one’s account and the all these calls are logged or can be analyzed. The output files are typically json formatted.  I wrote another article here which provides some background why I needed to perform an investigative work 🙂 on these files to identify an offending application.  In short,  AWS deprecated some of the API calls and any application that made these calls were to be migrated.

These logs had multiple json objects in single line!  And with more the API calls more the data and the number of output files.  In one case I had more than 5,000 files generated over couple of days.  At the client site I couldn’t get much help on code base since it was old Java code written by an outsourced company which had moved-on.

Then it became an exercise for me to use Apache Drill for the above scenario.  First I took a single file and ran a Drill query:

0: jdbc:drill:zk=local> select T.jRec.eventSource, T.jRec.eventName,
   T.jRec.awsRegion, T.jRec.sourceIPAddress,
   count(*)
from (select FLATTEN(Records) jRec
     from    dfs.`/cloudtrail_logs/144702NNNNNN_CloudTrail_us-east-1_20160711T2345Z_CJPTqBCGPPc1Bhqc.json`) T
group by T.jRec.eventSource, T.jRec.eventName,
T.jRec.awsRegion, T.jRec.sourceIPAddress
order by EXPR$1;

Read More »