Visualizing data processed consciously in a moment

Recently Google released a training material to its employees as part of how unconscious bias can happen in work place.  For more details see –  which refers to research paper at

It is very interesting and one of the slide (see below) specifically caught my attention with respect how huge amount of data is NOT processed consciously.   That is we can only handle 40 bits for every 11 million bits in a second!

That is just 1 part for every 275,000 parts!!

As interesting as it is the impact I thought will be even more dramatic when visualized it in some way.  Following are of couple of attempts at it using Python and ImageMagick.

Each tiny square (green or red) in the following picture represents 40 bits and each row has 275 of those squares. With 1000 rows (10 blocks of 100 rows each) of those we get 11 Million bits of data represented.  This is just for one second of duration!

Just imagine the the scale for one’s life time (80 years) data. All those 275,000 blocks above are to be repeated 2,522,880,000 times!!!

Another way is tiny string, say 5 pixels long, represents 40 bits then total length of 11M bits will be 275k * 5 pixels long!  See below.  Small green line at the center of the diagram is a string of few pixels and around which the really long string i.e., 275K times long string is wound around a square peg. At each of 25 rounds the string changes color for better visibility. Totally there are 742 sides.

at end of 25th round the length is 5,050 times longer and
at end of 50th round it is 20,100 times
at end of 100th round it is 80,200 times
at end of 150th round it is 180,300 times
and finally at 185.25 round it is 274,111 times longer

Note: Keep in mind it is not the size (area) of the squares but the total perimeter length of all these squares that represents ratio of data.

World Cup Data Analysis For Fun – Part II

Continuing from Part I ( ), following chart shows density of number of goals scored by country in a world cup tournament.  Black line in the fore ground is the average density of goals.

Some interesting facts:
* Purple peak is Wales with four goals in 1958 and that is the only year they played.
* Organge-yellowish peak is Bolivia scoring no goals twice and one goal once
* Large percentage (~80%) score no more than 10 goals in each tournament

Goals For Summary (per country per cup):

  • Min. :        0.0
  • 1st Qu.:     2.0
  • Median :   4.0
  • Mean :      5.7
  • 3rd Qu.:    8.0
  • Max. :     27.0
Goal Against Summary (per country per cup):

  • Min.   :     0.0
  • 1st Qu.:    4.0
  • Median :  5.0
  • Mean   :   5.7                                                                         
  • 3rd Qu.:   7.0  
  • Max.   :  17.0
While it is low number of goals scored in a each world cup (see chart above) it is also interesting to see the trend over many decades of all goals (scored + allowed) per game.  Here I applied the LOWESS (locally weighted scatter plot smoothing) non-parametric regression to better fit the data (blue line).


Though early in early years there were lot more goals each game, in the recent past (after 1970) it has stabilized around 2.7 goals per game.  But how do soccer power houses (Argentina, Brasil, Germany, etc.) compare with seven other countries chosen from another cluster (See part 1).  As one would expect you have to score more than you allow 🙂 and represented by gray dashed line on Y-axis i.e,

Goals Scored / Goals Allowed > 1

The colored year shows the winner of the World Cup on that year while the size of the bubble shows the total goals (Scored plus Allowed).  Six countries won all world cups between 1930 and 2006 except for the years 1930 and 1950 when Uruguay won and there were no world cups during 1942, 1946.

The outlier you see at the left top screen (BR, 10) is when Brazil scored 10 goals but allowed only 1 goal in 1986 in 5 matches while Argentina was the world cup winner scoring 14 goals and allowing 5 goals in 7 matches.

And the bottom (US, 0.14) big dot is for when US scored 1 goal and allowed 7 goals in 1934.

Using d3 visualization for fraud detection and trending

d3 (Data Driven Document) is a great data visualizing tool and recently used it to track possible fraud or how some metrics have behaved over few hours to over few weeks.  You can filter a group that is of interest out of, say, million or more and then use d3 to work through manually for more unique set.

The graph is dynamic (unlike in this blog) where you can select range of lines by moving the cursor over any of Y-axes and selecting them.  It is much more interesting in action than a snapshot of images below.

Below is users’ “A” metric over 5 hours, 5 to 12 hours, 12 hours to 1 day, similarly up to last 14 to 28 days.  The user group was selected from few different segments and each major color corresponds to a segment.  d3 automatically varies the color slightly for each line.  Though it is not very clear in the image below main colors were blue, green, orange and purple.

The input to d3 is a simple csv file and all the presentation is handled by d3 unlike in many previous packages I had used where I ended up creating a output file in html or some xml for flash.  Big advantage with d3 over these is attaching the html element to each data point in the programming and in-built data visualizing functions do the rest of magic.

In the above scenario, for example, you can move the cursor to left most scale (5 hour) and zoom in on lines above 1,700 to 2,000.  There is only one user well above the rest who have metric 200 or lower.  This user hasn’t done much in last 4 weeks until last 5 hours!  Time to look into what this user is doing and we use other tools to do further analysis.

Similar to above scenario, below is another graph where I am interested in about all users whose score was between 600 and 1,400 over last 2 to 7 days.  There is not much exciting in below graph and have seen more interesting details other times.

Happy data visualization!

User traversal digraphs

Visualizing the users’ traversal on a website or path traversed in an application or nodes visited may need to be analyzed for better user experience or improving the efficiency the path traversed or to improve the conversion rate by getting the user to the end state. For example, it may be that users visiting one set of nodes (node meaning particular state of application like tutorial state machine) convert better than users going through different set of states. There are many cases where a graph or directed graph is a necessity.

You can use freeware tools like ‘dot’ to draw directed graphs. In the following example, I assume the state/node is represented by simple integer and path taken by the edge (directed line).

Here, I have shown a simplified version of a project I worked on in quickly generating a digraph that can be sent or uploaded to site for internal business users on regular intervals in real time. Few assumptions are made to help explain the main idea and they are – each visit to the node is timestamped and user can only traverse forward (meaning visit to the node number higher than the current one). To start with data is summarized in a table with each user or device or some unique key identifier, node visited, and time of visit.

Order the data in chronological order for each unique id so that by doing the self-join (see code snippet below) in SQL we can simply find out the next node the user visited.

SELECT T1.node_id current_node, T2.node_id next_node, count(1) ct
FROM table T1 JOIN table T2 ON T1.row_id+1 = T2.row_id
WHERE T1.unique_id = T2.unique_id # say, user_id
GROUP BY T1.node_id, T2.node_id

You can also provide weightage/cost to the edge by normalizing the count which will result in set of rows similar to
      Current_node, next_node, count, count/max_count

This is all the data we need to generate input file for ‘dot’ application. Write a program that takes the above input and dump it into a file with content –

Digraph G {
   # splines=false;

   2 ->3 [penwidth=4.9, label=”1190″]
   3 -> 4 [penwidth=4.9, label=”1150″]
   4 -> 5 [penwidth=4.8]
   5 -> 6 [penwidth=4.8]
   6 -> 7 [penwidth=4.8]
   7 -> 8 [penwidth=4.8]

By providing this as input you can generate the output in multiple formats including pdf, ps, etc. See graph 1 below.  You can provide more input parameters in the file to fancy the graph or with more information like drop-off (% of users dropped) between states, see graph 2. In essence you are generating a digraph to visualize the data in a more meaningful way.

Digraph 1 – with sample weights between states 2 & 3, 3 & 4

With dot input file content like

  subgraph cluster1 {
    2 -> 3  [penwidth=”23″, label=”23.48%”,  taillabel=”4450″,  headlabel=”3405″]
    3 -> 4  [penwidth=”25″, label=”24.9%”,  taillabel=”3405″,  headlabel=”2557″]
    4 -> 5  [penwidth=”18″, label=”18.34%”,  taillabel=”2557″,  headlabel=”2088″]
    5 -> 6  [penwidth=”19″, label=”19.3%”,  taillabel=”2088″,  headlabel=”1685″]
    6 -> 7  [penwidth=”20″, label=”20.18%”,  taillabel=”1685″,  headlabel=”1345″]
    7 -> 8  [penwidth=”26″, label=”26.47%”,  taillabel=”1345″,  headlabel=”989″]
    8 -> 9  [penwidth=”35″, label=”35.29%”,  taillabel=”989″,  headlabel=”640″]
    9 -> 10  [penwidth=”39″, label=”38.59%”,  taillabel=”640″,  headlabel=”393″]
    10 -> 11 [penwidth=”36″, label=”35.88%”,  taillabel=”393″,  headlabel=”252″]

Digraph 2 – with users drop-off between states in %

Conversation Prism – An Image

As the social media, social networking, advertising, Internet marketing continue to evolve with new technologies and many companies create their own social groups, it all gets more complex and confusing.  Many a times a picture or an image will explain more elegantly than 1000 words are more and in some cases image is the most suited tool to explain.  Here is one image created by Brian Solis & Jesse Thomas that I like in this conversation!

Few other images that will help.

From Dave Fleet