Using AWS Sagemaker and CNN for Dog Breed Identification

Motivation for this post came from a recent image identification project I did where given a dog image the application would identify its breed.   It turned out be more interesting and enjoyable than when I started the work.  Even though I ended up spending substantial time learning some new concepts; refreshing linear algebra and calculus; reading other related articles and research papers.

In this project I used AWS Sagemaker, a Machine Learning (ML) platform to build, train and deploy the models.  It provides all needed components including prebuilt image suitable for ML projects, Jupyter Notebook environment, infrastructure to deploy with single click from notebook, etc. It uses a ResNet CNN that can be trained from scratch, or trained using transfer learning when a large number of training images are not available.

If you want to jump to notebook code here.

Neural Network (NN)

Neural networks draw inspiration from their counter part in biological neural networks. There are many NN machine learning algorithms based on those including perceptron, Hopfield networks, CNN, RNN, LSTM, etc.  Here in the article briefly covering perceptron and CNN.

Read More »

Macbook Pro Wifi Issue When Using 2.4GHz Band

Recently I got a Macbook Pro (model A1990) along with few other items including a USB 10-in-1 hub (EAUSOO model ES-HB003C).  All was well except for one particular vexing issue i.e., once a while computer would stop communicating with wifi router. If I am using browser it would take very long or timeout or running command line utilities that used network would fail.  Many a time the network speed would drop by 90% or more! Initially, without a clue on what was happening, I tried different things including (a) suspecting some application hogging the bandwidth, (b) other devices doing something similar, (c)  relative position of my Mac with a Wifi extender (TP-Link) which I was using with 2.4GHz band and (d) few other things.

It turned out to be the issue with USB hub.  Though some have experienced similar problem with latest (2017 or later) Macs it is somewhat rare.  There is a forum thread where many have expressed frustration with this problem and possible solutions.  For more details see

Since latest Macs are equipped only with usb-c ports for many of external devices one has to use hubs to connect monitors, flash drives, head phones, etc.  The setup would look very similar to one shown below in Intel paper.  And not able to connect them while using wifi became a blocker for me to use this new computer.  Interestingly,  in my case the issue was worse when I connected the hub to the right side ports while left side ports also did poorly.


Read More »

Macbook key codes and validating the new keyboard keys

Recently I bought a new keyboard (see below) on Amazon for Macbook pro. I needed another cheaper keyboard when I might use it with wall mounted monitor near a tread mill. It is a knock off of Mac’s magic keyboard.  Interestingly it has both bluetooth and wireless (2.4G) connectivity while having all the four keys Fn, Ctrl, Option and Cmd keys in the same order.  Many of the third party mac keyboards either miss Fn key or place it near number keys and many other place them in Windows compatible keyboard order which is not ideal.


This keyboard worked well but for one major issue – Esc key didn’t act the way one expected.  Nothing was happening and in “Vim” editor it is one of the most often used key.  I needed to either fix or identify the right key code it is generating or map it to new code or return the item.  I knew what ascii code (dec. 27) and key code (53) I should expect on pressing the key.  Here are the corresponding codes.

Read More »

Meltdown and Spectre patch effects on Mac performance

With recently identified major flaws in modern CPUs, all major software and (CPU) hardware companies rushed to provide as quick solutions as possible to their systems.  For more detailed information checkout this site and for a programmer’s view or to test on a linux system try this.

Since the computer OS (Operating System) kernel space is highly protected (as well as any other process space) and isolated from other processes interference, any breakdown in this will lead to major issues.  Quote from the (Meltdown) paper – “The attack is independent of the operating system, and it does not rely on any software vulnerabilities. Meltdown breaks all security assumptions given by address space isolation as well as paravirtualized environments and, thus, every security mechanism building upon this foundation”

And paper also provides the details on the scope of  the issue which affects all modern computers/ phones!! Quote – “On affected systems, Meltdown enables an adversary to read memory of other processes or virtual machines in the cloud without any permissions or privileges, affecting millions of customers and virtually every user of a personal computer” 

(PS: Bold highlighting added my me)

Reading through the paper and looking at the example code snippet below took me back to the days when I did some assembly level programming on Intel 8086 series. It was fun, challenging and interesting.

1 ; rcx = kernel address
2 ; rbx = probe array
3 retry:
4 mov al, byte [rcx]
5 shl rax, 0xc
6 jz retry
7 mov rbx, qword [rbx + rax]

Read More »

Precompiled Redshift Python DB adapter

Recently I built an application that uses AWS lambda to load data from datalake to Redshift at regular intervals.  The steps to compile the adapter suitable for AWS Lambda environment is given here.  I also uploaded it to github here and one can use it without having to go through compilation steps.




Compiling Python Redshift DB adapter for AWS Lambda env.

AWS lambda has gained huge momentum in the last couple of years and enabled software architects/ developers to build FaaS (Function as a Service).  As much as Lambda helps in scaling applications, it has some limitations like execution duration or memory space availability, etc.   For long running jobs, typically in the backend or batch processing, 5 minute duration can be a deal breaker.  But with appropriate data partitions and architecture it is still an excellent option for enterprises to scale their applications and be cost effective.

In the recent project, I architected data be loaded from a datalake into Redshift.  The data is produced by an engine in batches and pushed to s3.  The data partitioned on time scale and a consumer Python application will load this data at regular intervals into Redshift staging environment.  For scalable solution datalake can be populated from multiple producers and similarly one or more consumers can drain the datalake queue to load to Redshift.  The data from multiple staging tables are then loaded to final table after deduping and data augmentation.

Read More »

Quick rental market analysis with Python, Panda

houses_image_drawing rental_analysis_boxplot_2016

Real estate and rentals are always interesting, especially when one is in the market either as a tenant or a landlord.  Though some data is available in the market through APIs it is generally old by days or weeks or already summarized to the extent that it is not much helpful. And if you like to get most granular data to perform some analysis it gets even harder. Craigslist website is one good alternative and you can get most recent (real-time) data that one can use.  But it needs quite a bit of work to pull, extract and clean before using it.Read More »

AWS CloudTrail log analysis with Apache Drill

AWS CloudTrail tracks API calls made in one’s account and the all these calls are logged or can be analyzed. The output files are typically json formatted.  I wrote another article here which provides some background why I needed to perform an investigative work 🙂 on these files to identify an offending application.  In short,  AWS deprecated some of the API calls and any application that made these calls were to be migrated.

These logs had multiple json objects in single line!  And with more the API calls more the data and the number of output files.  In one case I had more than 5,000 files generated over couple of days.  At the client site I couldn’t get much help on code base since it was old Java code written by an outsourced company which had moved-on.

Then it became an exercise for me to use Apache Drill for the above scenario.  First I took a single file and ran a Drill query:

0: jdbc:drill:zk=local> select T.jRec.eventSource, T.jRec.eventName,
   T.jRec.awsRegion, T.jRec.sourceIPAddress,
from (select FLATTEN(Records) jRec
     from    dfs.`/cloudtrail_logs/144702NNNNNN_CloudTrail_us-east-1_20160711T2345Z_CJPTqBCGPPc1Bhqc.json`) T
group by T.jRec.eventSource, T.jRec.eventName,
T.jRec.awsRegion, T.jRec.sourceIPAddress
order by EXPR$1;

Read More »

Invalid data dump – Amazon Redshift, Data Pipeline and S3

Amazon Data Pipeline (DPL) is late entrant to the ETL market but provides many features that are well integrated to AWS cloud.  In any data extraction process one would encounter invalid or incorrect data and that data may either be logged or ignored depending on the business requirements or severity of rejected data.

When you have your data flow through S3 to other platforms, be it, Redshift, RDS, DynamoDB, etc. in AWS you can use S3 to dump that data.   In an example, similar to one DPL below, in one of the step you could filter and dump to S3 for later analysis.

By standardizing the rejections from different DPLs, another DPL can regularly load them back into Redshift for quick realtime analysis or deep dive into them downstream.  This will also greatly help in recovery and reruns when needed.

Following is simple high level steps where rejected data is directed to S3.  The parameters are provided through the environment setup.  For example: #{myDPL_schema_name} = ‘prod_public’ and #{myDPL_error_log_path} = ‘s3://emr_cluster/ad/clicks/…’


-- PreProcess
-- Load staging stable and at the same time update data error column in it when possible.
INSERT INTO #{myDPL_schema_name}.#{myDPL_staging_table}
SELECT col1,
AS data_error
FROM #{myDPL_schema_name}.#{myDPL_source_table}
LEFT OUTER JOIN #{myDPL_schema_name}.table_1
ON ...
LEFT OUTER JOIN #{myDPL_schema_name}.dim_1
ON ...
LEFT OUTER JOIN #{myDPL_schema_name}.dim_N
ON ...

-- OR, If data_error column is updated separately...
UPDATE #{myDPL_schema_name}.{myDPL_staging_table}
SET data_error = ...
FROM #{myDPL_schema_name}.{myDPL_staging_table}
JOIN #{myDPL_schema_name}.dim_1
JOIN #{myDPL_schema_name}.dim_N

-- Temporary table
CREATE TEMP TABLE this_subject_dpl_rejections AS (
FROM #{myDPL_schema_name}.#{myDPL_staging_table}
WHERE data_error IS NOT NULL

-- Dump to S3
UNLOAD ('SELECT * FROM this_subject_dpl_rejections')
TO '#{myDPL_ErrorLogPath}/yyyy=#{format(@scheduledStartTime,'YYYY')}/
CREDENTIALS 'aws_access_key_id=#{myDPL_AWSAccessKey};aws_secret_access_key=#{myDPL_AWSSecretKey}'

Now load the errors back to Redshift…

COPY #{myDPL_schema_name}.#{myDPL_error_table}
FROM '#{myDPL_ErrorLogPath}/yyyy=#{format(@scheduledStartTime,'YYYY')}/
CREDENTIALS 'aws_access_key_id=#{myDPL_AWSAccessKey};aws_secret_access_key=#{myDPL_AWSSecretKey}'