NASA Orion Journey To Mars Data Analysis

I have always been very much interested in Physics and enjoy reading related books, articles or watch shows like Carl Sagan’s Cosmos, Freeman’s Through the Wormhole, etc.  For that matter this site name ** is derived from H-I-Region (Interstellar cloud).

When I saw NASA’s – “Send Your Name on NASA’s Journey to Mars, Starting with Orion’s First Flight”, I was excited to put my family, relatives and friends’ names along with few charity names.  The names will be placed on a microchip of Orion’s test flight on Dec. 4, 2014 that orbits around the Earth and on future journey to Mars!  Following quote from the NASA site:

Your name will begin its journey on a dime-sized microchip when the agency’s Orion spacecraft launches Dec. 4 on its first flight, designated Exploration Flight Test-1. After a 4.5 hour, two-orbit mission around Earth to test Orion’s systems, the spacecraft will travel back through the atmosphere at speeds approaching 20,000 mph and temperatures near 4,000 degrees Fahrenheit, before splashing down in the Pacific Ocean.

But the journey for your name doesn’t end there. After returning to Earth, the names will fly on future NASA exploration flights and missions to Mars.

More info at

Orion test flight uses big boy Delta IV (biggest expendable launch system) and Orion after orbiting earth twice will reenter and splash down in Pacific ocean.

Courtesy NASA/

Some of sample boarding passes:

By the time the entries were closed, I think it was on Oct.31, there were nearly 1.4million (1,379,961 exactly) names and the top countries were United States, India and United Kingdom by count with people from almost all countries having submitted their names.  For more details see .  Bar chart below shows the same info.

Though US, India and UK were the top three by number of names submitted I was curious to know how countries did when adjusted for population size, GDP and area (sq. miles).  With that in mind I pulled NASA data and country data from the following web sites.

Built a quick Python script to do data pull, join country data and perform some minor calculations.  The code is located here at Gist or see end of this post.

Running through few R scripts and clustering them based on each country’s

  • Orion passenger count/ 10K people
  • Orion passenger count/ 1K sq. miles
  • Orion passenger count/ Billion $ GDP

and then normalized through R scale for cluster selection.  Optimal cluster seem to be 7 or 8. Monaco and Singapore are major outliers due to skew that happened with their small geographical area (sq. miles). See below – Monaco is that single dangler at the top right and Singapore/ Hungary are at bottom right but above rest of other countries.

Scatter plot shows much more clearly the two countries standing out especially in the middle tiles below – passengers_per_1K_sq_miles vs other two metrics ( passengers_per_10K_population and passengers_per_1Billion_gdp).

And after removing those two countries from the data frame and clustering again results in the following:

That is an interesting cluster.  Countries that had highest entries adjusted for population, GDP, geo size Hungary tops the list! Maldives, Hong Kong, UK and Malta take other top 4 places.  Quick normalized scores look like:

country Score(/Pop.) Score(/Area) Score(/GDP) Score_ABS
Hungary 5.783493976 1.560361327 4.485219257 11.82907456
Maldives 0.715814116 4.784567704 4.43908513 9.939466951
Hong Kong -0.217141885 7.8493819 -0.59223565 8.658759434
United Kingdom 3.957774546 2.869764313 1.288187419 8.115726277
Malta 1.085016478 5.903919255 0.393610721 7.382546454
Bangladesh -0.195758981 1.116466958 4.697494631 6.00972057

Cluster (optimal) size analysis:

It is always fun playing around with different ways to slice and dice data and below bubble world map shows simple metric of passengers count for each billion dollar GDP.

Top 5 countries, in this case, are

Bangladesh 133.95982
Hungary 128.75381
Maldives 127.62238
Philippines 125.95591
Kosovo 106.8

It will be more interesting to see how the numbers relate with each country’s science and technology budget.  I will try doing it in next few days as some of the data is already available in the wild.  In ideal world there should be good percent of the yearly budget allocated to Science & Tech.

Data pull Python code:

# -*- coding: utf-8 -*-

import os
import sys
import re
import locale
import pprint
import scraperwiki
from bs4 import BeautifulSoup
from collections import defaultdict

class NasaData():
nasa_file_path = "/tmp/nasa_orion_reg_by_country.txt"
ctry_file_path = "/tmp/countrycode_org_data.txt"
nasa_site = ""
ctry_site = ""
metrics_file_path = "/tmp/nasa_metrics_by_country.txt"

def __init__(self):

def get_nasa_entries():
Scrape NASA Orion participants count by country data
Ouptput to file nasa_orion_reg_by_country.txt
Args: None

html = scraperwiki.scrape( NasaData.nasa_site )
soup = BeautifulSoup( html )

out_file = NasaData.nasa_file_path
if os.path.exists( out_file ) and os.path.getsize( out_file ) > 10:
print "Warning: " + out_file + " exists. Continuing without scraping NASA data.\n"
return False

countries = soup.find( 'ul', class_='countryList' )
with open( out_file, 'wt' ) as fh:
for country in countries.findAll('li'):
c_name = country.find('div', class_='countryName').text
c_num = country.find('div', class_='countNumber').text.strip()
# line = c_name + "," + c_num + "\n"
line = ''.join([c_name, ',', c_num, '\n'])

return True

def get_country_details():
Scrape countrycode data including population, gdp, area, etc.
Dump output to file countrycode_org_data.txt
Args: None

html = scraperwiki.scrape(NasaData.ctry_site)
soup = BeautifulSoup(html)

out_file = NasaData.ctry_file_path
if os.path.exists( out_file ) and os.path.getsize( out_file ) > 10:
print "Warning: " + out_file + " exists. Continuing without scraping COUNTRY_CODE data.\n"
return False

cnty_table = soup.find( lambda tag: == 'table' and tag.has_attr('id') and tag['id'] == "main_table_blue" )
countries = cnty_table.findAll( lambda tag: == 'tr' )
with open( out_file, 'wt' ) as fh:
for country in ( countries ):
cnty_str = '|'

cnty_attr = country.findAll( lambda tag: == 'th' )
if ( cnty_attr ):
for attr in ( cnty_attr ):
cnty_str += attr.contents[0] + "|"
cnty_attr = country.findAll( lambda tag: == 'td' )
if ( cnty_attr ):
for ix, val in ( enumerate(cnty_attr) ):
if ix == 0:
cnty_str += val.findAll( lambda tag: == 'a' )[0].string + "|" # Get country name
cnty_str += val.contents[0].strip() + "|" # Get country attrs

# print cnty_str
fh.write( cnty_str + "\n" )

return True

def join_country_data():
Join two data sets by country name and write to file nasa_metrics_by_country.txt
country names and its metrics
Args: None
fh = open( NasaData.metrics_file_path, 'wt' )
# Country names lowercased, removed leading "The ", removed leading/trailing and extra spaces
nasa_data = defaultdict(list)
cc_org_data = {}

for line in open( NasaData.nasa_file_path, 'rt' ):
ln_els = line.strip('\n').split(',')
ln_els[0] = ln_els[0].lower()
ln_els[0] = re.sub(r'(^[Tt]he\s+)', '', ln_els[0])
ln_els[0] = re.sub(r'(\s{2,})', ' ', ln_els[0])
nasa_data[ln_els[0]].append(ln_els[1]) # orion_vote appended

# nasa_data dict appended with country data. key:country => values[orion_votes, pop., area, gdp]
for l_num, line in enumerate( open( NasaData.ctry_file_path, 'rt') ):
# line: |Afghanistan|AF / AFG|93|28,396,000|652,230|22.27 Billion|
if l_num == 0: continue # Skip header

ln_els = line.strip('\n').split('|')
ln_els[1] = ln_els[1].lower()
ln_els[1] = re.sub(r'(^[Tt]he\s+)', '', ln_els[1])
ln_els[1] = re.sub(r'(\s{2,})', ' ', ln_els[1])

# Strip out comma in pop(element 4) and area (5)
nasa_data[ln_els[1]].append( ln_els[4].translate(None, ',') ) # pop appended
nasa_data[ln_els[1]].append( ln_els[5].translate(None, ',') ) # area appended

# Normalize gdp to millions
gdp = re.match( r'(\d+\.?\d*)', ln_els[6] ).group(0)
gdp = float(gdp)
if r'(Billion)', ln_els[6], re.I ):
gdp = gdp * 1000
elif r'(Trillion)', ln_els[6], re.I ):
gdp = gdp * 1000000
nasa_data[ln_els[1]].append( gdp ) # gdp appended

# TODO: Some country names are not standard in NASA data. Example French Guiana is either Guiana or Guyana
# Delete what is not found in country code data or match countries with hard coded values

locale.setlocale(locale.LC_ALL, '')
for cn in sorted(nasa_data): # country name
# array has all nasa_votes, pop., sq miles, gdp and has pop > 0 and gdp > 0. Capitalize name.
if len(nasa_data[cn]) > 3 and int(nasa_data[cn][1]) > 0 and int(nasa_data[cn][3]) > 0:
l = ( cn.title() + ":" + nasa_data[cn][0]
+ ":" + locale.format( '%d', int(nasa_data[cn][1]), 1 ) # pop
+ ":" + str( round( float( nasa_data[cn][0] ) * 10000/ int(nasa_data[cn][1]), 5 )) # per 10K pop
+ ":" + locale.format( '%d', int(nasa_data[cn][2]), 1 ) # area
+ ":" + str( round( float( nasa_data[cn][0]) * 1000 / int(nasa_data[cn][2]), 5 )) # per 1K sq mile
+ ":" + locale.format( '%d', int(nasa_data[cn][3]), 1 ) # gdp
+ ":" + str( round( float( nasa_data[cn][0]) * 1000 / nasa_data[cn][3], 5 )) # per Billion $ gdp
+ "\n"

return True

if __name__ == "__main__":
exit( 0 )

World Cup Data Analysis For Fun – Part II

Continuing from Part I ( ), following chart shows density of number of goals scored by country in a world cup tournament.  Black line in the fore ground is the average density of goals.

Some interesting facts:
* Purple peak is Wales with four goals in 1958 and that is the only year they played.
* Organge-yellowish peak is Bolivia scoring no goals twice and one goal once
* Large percentage (~80%) score no more than 10 goals in each tournament

Goals For Summary (per country per cup):

  • Min. :        0.0
  • 1st Qu.:     2.0
  • Median :   4.0
  • Mean :      5.7
  • 3rd Qu.:    8.0
  • Max. :     27.0
Goal Against Summary (per country per cup):

  • Min.   :     0.0
  • 1st Qu.:    4.0
  • Median :  5.0
  • Mean   :   5.7                                                                         
  • 3rd Qu.:   7.0  
  • Max.   :  17.0
While it is low number of goals scored in a each world cup (see chart above) it is also interesting to see the trend over many decades of all goals (scored + allowed) per game.  Here I applied the LOWESS (locally weighted scatter plot smoothing) non-parametric regression to better fit the data (blue line).


Though early in early years there were lot more goals each game, in the recent past (after 1970) it has stabilized around 2.7 goals per game.  But how do soccer power houses (Argentina, Brasil, Germany, etc.) compare with seven other countries chosen from another cluster (See part 1).  As one would expect you have to score more than you allow 🙂 and represented by gray dashed line on Y-axis i.e,

Goals Scored / Goals Allowed > 1

The colored year shows the winner of the World Cup on that year while the size of the bubble shows the total goals (Scored plus Allowed).  Six countries won all world cups between 1930 and 2006 except for the years 1930 and 1950 when Uruguay won and there were no world cups during 1942, 1946.

The outlier you see at the left top screen (BR, 10) is when Brazil scored 10 goals but allowed only 1 goal in 1986 in 5 matches while Argentina was the world cup winner scoring 14 goals and allowing 5 goals in 7 matches.

And the bottom (US, 0.14) big dot is for when US scored 1 goal and allowed 7 goals in 1934.

World Cup Data Analysis For Fun – Part I

With the world cup fever of 2014 around it is interesting to do some analysis and dig deeper through stats.  Here is an attempt during a weekend.

I pulled some publicly available data of all world cups from 1930 to 2006 and after cleaning it up for my purpose it had the following entries for each match/game:
Country, Year, FIFA_Winner, Country_Code,
Goals_For, Goals_Against, Matches, Penalties,
Won, Drawn, Lost, Corners, Offsides,
Shots_On_Goal, Free_Kicks, etc.

My first attempt was to take a look at how the countries cluster together and it would also be easy to validate the clustering with some prior knowledge of world cup. For example, one would expect Brazil, Germany, Argentina and few others possibly cluster together.

As in any statistical analysis it is bit of challenge to decide how to handle missing values.  In the above data, fields like “Shots on Goal, Shots Wide, Free Kicks, Corners” were not available up until 2002.  Either these values can be set to 0 or handle with mean of the available data (over the available period) with function like

mean_vec < – function(vec) {
    m mean(vec, na.rm = TRUE)
    vec[] m

where you replace ‘NA’ with mean.  It could be used either column-wise or row-wise through apply function.  It is grand mean of each column which introduces its own errors into model.  Better would be to have mean at country level (a simple and straight forward and works better for data with Gaussian distribution characteristics) or other techniques including regression substitution, most probable value sub., etc.  For some more details see

Running the sum-squared-error (SSE) yielded the below chart. With the elbow/bend between 4 and 6 it would be sufficient to have minimum 4 clusters. I choose 10 for below analysis.

With 10 clusters it resulted in following dendogram:

How do the Soccer power houses like Brazil, Germany and few others (cluster 7 from left in the above diagram) would compare with few others.  One metric is how many goals do they score in each match while allowing some.  Density plots would be one visualization where I plotted 3 dimensional density with “Goals For” in X axis and “Goals Against” in Y axis. I left Sweden from list for now.  Here is a twin peak with 1 and 2 goals in favor while ~0.5 goals against per game.  Contrast this with one other countries below.

Comparing with the 7 other countries from the last cluster (#10 in the above dendogram), I get different density plot where peak happens with ~0.6 goals in favor while ~2 goals against per game.

PS: Note the difference in scales between these two plots.  It will be interesting super impose one above the other with the same scale along 3 dimensions.

Use of heat map is another visualization with more details including deviation of each variable (represented by light blue vertical lines below).  Compare below “Games Lost and Goals Against” with “Games Won and Goals For” for the two clusters.  Also Shots on Goal.

More (part II) analysis at:

Users Password Analysis

As a data engineer it is always interesting to work on large unique data sets.  With recently released Yahoo users details (453K) many insightful info can be gleaned from the data.  For example, even though password hacking is well known for long time still large number of users use simple passwords, sometime as simple as “password” or “123456” or similar.  Here are top 10 passwords and number of users who had used them!

123456         1667
password       780
welcome        437
ninja              333
abc123          250
123456789   222
12345678     208
sunshine        205
princess         202
qwerty          172 

It is interesting to see how many users had unique passwords which was not used by anyone in this data set.  There were 10.6K users with no password which might be due to data issue and ignored for many of calculations and only ~304K (69%) users with unique passwords.

Another interesting insight is if password is used by more than one user, there is likely hood that it is some kind of latin word or words (“whatever”, “iloveyou”) or proper name (“jordon”, “ginger”) or some number (123321) or what can easily be guessed (for example, “q1w2e3r4” for qwerty keyboard or “asdfgh”, etc.).  Even when two users used the same password there was some certainty that it is a guessable password! With each additional user the certainty increases quite quickly.  Under these circumstances, even if a password is encrypted (by md5 or sha or other encryptions) by service providers, with brute force application one can find out the password for these users.

By also looking into how users from different email service providers had their passwords setup showed the following.  As expected, Yahoo had more users (x-axis) while smaller companies (“others” in the chart) had more number of users (71.7%) with unique passwords.  At the same time gmail and live users’ password length is more than 8.87.  Length of the passwords is represented by size of the bubble.

Having bigger bubble size and higher up in the Y-axis is better as it represents more users using unique passwords with longer password strings.  See table below for more details.

Even more interesting analysis can be done including people’s or places’ names in their password.  One could be able to use popular names from US Social Security Administration’s and names’ list go as back as 1880! There were lot more passwords that simply used these names!  Lot more matches can be found with minor modifications like changing i to 1 or o to 0 (zero), etc.

With many users using simple passwords service providers or websites should force each user to have stronger password by enforcing them during the registration or each login.  Users should also be forced change them once in few months.  It might be even better each computer equipped with finger or eye reader that can be used for user authentication thus avoiding this whole password mess.

Visualizing daily metric with tiles

One of the effective way to present a time series data over long period of time is either typical line charts or some modified versions of it.  It gets little harder to visualize when would like to do see clustering of data points.  Here is one where I find it tiled series that gives quick glimpse of a some metric varying over few years.  This is a daily metric starting January 01, 2010 and up to recent week of May, 2012; nearly 2 and 1/2 years of metric variation.

The Y-axis measures represent day of the week (wday) with 1 representing Monday, 2 Tuesday and so forth with 7 as Sunday.  I set the values to these so that I could cluster weekends metrics (5,6 & 7) together vs the weekday (1,2,3,4 – Monday to Thursday).  X-axis represent week of the year (1 to 52 or 53).  Year 2010, 2011 and 2012 are series.  Metric values varies from 0.00 to 10.00 and color of each tile varies slightly based on the metric value.

If you were to use discrete values of metrics say, 0 to 5 the color coding is similarly quite distinct.   See    graph 2 below.  The data is from 2nd week of March, 2010 to May, 2012.

Graph 1
Graph 2