Saturday, May 12, 2012

My own version of bubble plot (part 1)

During one of my projects, I found myself in need of visualizing more than 3 dimensions at once. Three-dimensional graphs are not a good solution, usually - they will need to be properly oriented, for a start, ad that's tricky.
So, I started looking at bubble plots. The size of the bubble can show one property, as illustrated by the nice post at FlowingData - then you can show one more property defined by a color scale (continuous below, but nothing stops it from being categorical) 

I decided to push it and have two properties: look at the example below - the redder the color, the higher the value on the property ApKUpt (or whatever you want). The greener, the higher ApVUpt. I moved the color legend to a square on the extreme right to achieve a better use of the available space.


I tried three colors but it turns out that it just doesn't work. Even when your eyes don't interpret every rgb triplet as a completely different color, the amount of redness, greeness or blueness is difficult to estimate. Also, it gets tricky to show the color grading in a legend... One has to resort to slices of the three-dimensional color space. See what I mean?

Of course, one can define an ad-hoc color scale, such as the one used below, vaguely inspired by the colors that Mathematica uses to paint its surfaces. Many thanks to my colleague Pär for teaching me how to define these kind of color scales, and much else.

Here follows the code for the one, two and three colors plot:

# bubble plot - takes in a dataframe with five different properties, for X, Y, Size, ColorA, B and C, plus additional parameters to control appearances of the plots produced
bubbles_in_2colors<-function( dataset, to_jitter=F, prop_names=NULL, titles=list(main='X vs Y', color='colors', size='size'), min_size = 1, max_size = 5, max_sat = 200, fav_col='green', alpha_default=0.7, ...) {
# NB: this code can handle a third primary color, but human eyes can't so for the moment I've forbidden that option.
if(exists('dataset')) {
if (is.null(prop_names)) {print(prop_names<-names(dataset))};
if (length(prop_names) ==4) {propX<-dataset[,1]; propY<-dataset[,2]; propS<-dataset[,3]; propCA<-dataset[,4]}
else if (length(prop_names) ==5) {propX<-dataset[,1]; propY<-dataset[,2]; propS<-dataset[,3]; propCA<-dataset[,4]; propCB<-dataset[,5]}
else if (length(prop_names) ==6) {propX<-dataset[,1]; propY<-dataset[,2]; propS<-dataset[,3]; propCA<-dataset[,4]; propCB<-dataset[,5]; propCC<-dataset[,6]; print("Three colors are too many for measly human eyes")}
else {stop(sprintf("I don't know what to do with an array with %d properties",length(prop_names)))}
}else {stop('did you actually give me a dataset to work with?')}
rangepropX<-range(propX); log_x=log_it(rangepropX); if (!exists('xlab')) {xlab=prop_names[1]}
rangepropY<-range(propY); log_y=log_it(rangepropY); if (!exists('ylab')) {ylab=prop_names[2]}
rangepropS<-range(propS); log_s=log_it(rangepropS);
# the line below restricts the sizes in case the span of values isn't too spread out.
Max2min<- rangepropS[2]/rangepropS[1]; if ((!log_s) & (Max2min<5)) {min_size=2; max_size=min_size*Max2min}
log_xy=paste(if(log_x){'x'}else{''},if(log_y){'y'}else{''},sep="");
rangepropCA<-range(propCA); log_cA=log_it(rangepropCA);
if (exists('propCB')) {rangepropCB<-range(propCB); log_cB=log_it(rangepropCB);} else{rangepropCB<-c(0,0); log_cB=F}
if (exists('propCC')) {rangepropCC<-range(propCC); log_cC=log_it(rangepropCC);} else{rangepropCC<-c(0,0); log_cC=F}
pch <- if (!exists('pch')) {21} else {pch} # 21 for circles, 22 for squares, 23 for diamonds, 24 for triangles up, 25 for triangles down
# the color palette is obtained by spreading the known palette (log)proportionally across the range of propC
# how about setting the color directly in rgb space?
if (!exists('palette_l')) {palette_l<-50}
colorpalette<-0:palette_l*(max_sat/255)/palette_l
print(colorpalette)
if(fav_col=='red') {
r_colpaltt<- colorpalette[1+invprop(propCA,rangepropCA[1],rangepropCA[2],log_cA)*(palette_l-1)]
g_colpaltt<- if (exists('propCB')) colorpalette[1+invprop(propCB,rangepropCB[1],rangepropCB[2],log_cB)*(palette_l-1)] else 0
b_colpaltt<- if (exists('propCC')) colorpalette[1+invprop(propCC,rangepropCC[1],rangepropCC[2],log_cC)*(palette_l-1)] else 0
}else if(fav_col=='green') {
g_colpaltt<- colorpalette[1+invprop(propCA,rangepropCA[1],rangepropCA[2],log_cA)*(palette_l-1)]
r_colpaltt<- if (exists('propCB')) colorpalette[1+invprop(propCB,rangepropCB[1],rangepropCB[2],log_cB)*(palette_l-1)] else 0
b_colpaltt<- if (exists('propCC')) colorpalette[1+invprop(propCC,rangepropCC[1],rangepropCC[2],log_cC)*(palette_l-1)] else 0
}else if(fav_col=='blue') {
b_colpaltt<- colorpalette[1+invprop(propCA,rangepropCA[1],rangepropCA[2],log_cA)*(palette_l-1)]
g_colpaltt<- if (exists('propCB')) colorpalette[1+invprop(propCB,rangepropCB[1],rangepropCB[2],log_cB)*(palette_l-1)] else 0
r_colpaltt<- if (exists('propCC')) colorpalette[1+invprop(propCC,rangepropCC[1],rangepropCC[2],log_cC)*(palette_l-1)] else 0
}
colpaltt<- rgb(r_colpaltt, g_colpaltt, b_colpaltt, alpha_default)
# print(colpaltt)
if (!length(colpaltt[!is.na(colpaltt)])) colpaltt[is.na(colpaltt)]<-'black'
# the size palette is obtained by (log)proportional representation of the propC range in the min_size-max_size range.
# minimum and maximum for printing symbols 0.5 to 1, or given by the user.
sizepaltt<-(prop(invprop(propS,rangepropS[1],rangepropS[2],log_s),min_size, max_size, log_s))
if (!length(sizepaltt[!is.na(sizepaltt)])) sizepaltt[is.na(sizepaltt)]<-1
# print("defined palettes"); print(colpaltt); print(sizepaltt)
if (to_jitter) {
jittered_X <- jitter_logalong(propX);
jittered_Y <- jitter_logalong(propY);
}else{
jittered_X <- (propX);
jittered_Y <- (propY);
}
# print("jittered X and Y")
# layout(matrix(c(1,1,1,2,1,1,1,3), 2, 4, byrow = TRUE)) # this works but let's try something fancier
if(!exists('propCB')) {layout(matrix(c(2,2,1,3), 2, 2, byrow = TRUE), widths=c(3,1), heights=c(1,3))}
else if(!exists('propCC')) {layout(matrix(c(1,2,1,3), 2, 2, byrow = TRUE), widths=c(3,1), heights=c(2,2))}
else {layout(matrix(c(1,2,1,3,1,4,1,5), 4, 2, byrow = TRUE), widths=c(4,1), heights=c(1,1,1,1))}
# print(sprintf("X from %G to %G, Y from %G to %G",rangepropX[1],rangepropX[2],rangepropY[1],rangepropY[2]))
plot(NULL, xlim=rangepropX, ylim=rangepropY, log=log_xy, bty='n', xaxt='n', yaxt='n', main=titles$main, xlab=prop_names[1], ylab=prop_names[2]); # , xaxs="i" to plot exactly within xlim,
# grid(nx = NULL, ny = NULL, col = "darkgray", lty = "dotted", lwd = 1, equilogs = T)
my_grid(nx = NULL, ny = NULL, xlims=rangepropX, ylims=rangepropY, col = "darkgray", lty = "dotted", lwd = 1, equilogs = T)
axXticks<-sort(unique(propX));
if (length(axXticks) > 3) {
if(log_it(axXticks)) {
axXticks<-within(log_pretty(axXticks, 1),axXticks,F, T)
}else{
axXticks<-within(pretty(axXticks),axXticks,F, T)
}
}
axis(1, at=axXticks, labels=sprintf("%5.3g",axXticks))
axYticks<-sort(unique(propY));
if (length(axYticks) > 3) {
if(log_it(axYticks)) {
axYticks<-within(log_pretty(axYticks, 1),axYticks,F, T)
}else{
axYticks<-within(pretty(axYticks),axYticks,F, T)
}
}
axis(2, at=axYticks, labels=sprintf("%5.3g",axYticks))
# Not yet working - attempt to reorder points so that the small ones are plotted above the little ones
# for (n in 1:length(propS)) {indx<-c(indx,which(propS==sort(propS)[n]))}
# indx<-rev(indx)
# jittered_X<-jittered_X[indx]
# jittered_Y<-jittered_Y[indx]
# propS<-propS[indx]
# propC<-propC[indx]
points(jittered_X, jittered_Y, , pch=21, cex=sizepaltt, col="white", bg=colpaltt);
if (length(propX) < 10) {text(x=jittered_X, y=jittered_Y, labels=dots_names, pos=4)}
# now it's the time to print a nice colorspace as legend for the bubbles
# the colorbar is a second plot, going at the top right (see layout(matrix()) command, much above)
# I'll have two different colorspaces depending on whether two or three dimensions are being used as colors
if(!exists('propCB')) {
plot(NULL, xlim=range(propCA), ylim=c(0,1), log=if(log_cA) 'x' else '', xaxt='n', yaxt='n', xlab=prop_names[4], ylab=prop_names[5], bty='n', main=titles$color);
for (n in 1:palette_l) {
if (fav_col == 'red') rgbcol<-rgb(colorpalette[n], 0, 0, alpha_default)
else if (fav_col == 'green') rgbcol<-rgb(0, colorpalette[n], 0, alpha_default)
else if (fav_col == 'blue') rgbcol<-rgb(0, 0, colorpalette[n], alpha_default)
rect(xleft= prop(x=(n-1)/(palette_l),rangepropCA[1],rangepropCA[2],log_cA)
,xright= prop(x=(n)/(palette_l),rangepropCA[1],rangepropCA[2],log_cA)
,ytop= 1
,ybottom= 0
, col=rgbcol
, border="transparent"
)
}
ycolorbarlegend<- 0.5
colorticks<- if (log_cA) within(log_pretty(rangepropCA, 1),rangepropCA, F, F) else within(pretty(rangepropCA),rangepropCA, F)
axis(1, at=colorticks, labels=sprintf("%5.3g",colorticks))
}else if(!exists('propCC')) {
plot(NULL, xlim=range(propCA), ylim=range(propCB), log=paste(if(log_cA) 'x' else '',if(log_cB) 'y' else '', sep=''), xaxt='n', yaxt='n', xlab=prop_names[4], ylab=prop_names[5], bty='n', main=titles$color);
for (m in 1:palette_l) {
for (n in 1:palette_l) {
rgbcol<-rgb(colorpalette[n],colorpalette[m],0, alpha_default);
rect(xleft= prop(x=(n-1)/(palette_l),rangepropCA[1],rangepropCA[2],log_cA)
,xright= prop(x=(n)/(palette_l),rangepropCA[1],rangepropCA[2],log_cA)
,ytop= prop(x=(m-1)/(palette_l),rangepropCB[1],rangepropCB[2],log_cB)
,ybottom= prop(x=(m)/(palette_l),rangepropCB[1],rangepropCB[2],log_cB)
,col=rgbcol
,border="transparent"
)
}
}
colorAticks<- if (log_cA) within(log_pretty(rangepropCA, 1),rangepropCA, F, T) else within(pretty(rangepropCA),rangepropCA, F, T)
axis(1, at=colorAticks, labels=sprintf("%5.3g",colorAticks))
colorBticks<- if (log_cB) within(log_pretty(rangepropCB, 1),rangepropCB, F, T) else within(pretty(rangepropCB),rangepropCB, F, T)
axis(2, at=colorBticks, labels=sprintf("%5.3g",colorBticks))
}else{
for (o in c(1,palette_l/2, palette_l)) {
plot(NULL, xlim=range(propCA), ylim=range(propCB), log=paste(if(log_cA) 'x' else '',if(log_cB) 'y' else '', sep=''), xaxt='n', yaxt='n', xlab=prop_names[4], ylab=prop_names[5], bty='n', main=sprintf("ColC=%g percent",100*o/palette_l));
for (m in 1:palette_l) {
for (n in 1:palette_l) {
rgbcol<-rgb(colorpalette[n],colorpalette[m],colorpalette[o], alpha_default)
rect(xleft= prop(x=(n-1)/(palette_l),rangepropCA[1],rangepropCA[2],log_cA)
,xright= prop(x=(n)/(palette_l),rangepropCA[1],rangepropCA[2],log_cA)
,ytop= prop(x=(m-1)/(palette_l),rangepropCB[1],rangepropCB[2],log_cB)
,ybottom= prop(x=(m)/(palette_l),rangepropCB[1],rangepropCB[2],log_cB)
,col=rgbcol
,border="transparent"
)
}
}
# flush.console(); Sys.sleep(0.5)
}
colorAticks<- if (log_cA) within(log_pretty(rangepropCA, 1),rangepropCA, F, T) else within(pretty(rangepropCA),rangepropCA, F, T)
axis(1, at=colorAticks, labels=sprintf("%5.3g",colorAticks))
colorBticks<- if (log_cB) within(log_pretty(rangepropCB, 1),rangepropCB, F, T) else within(pretty(rangepropCB),rangepropCB, F, T)
axis(2, at=colorBticks, labels=sprintf("%5.3g",colorBticks))
}
# the bubble legend is a third plot, going in the lower right.
plot(NULL, xlim=widen(c(-1,1), 0.9), ylim=widen(rangepropS, 0.9), log=if(log_s) 'y' else '', xaxt='n', yaxt='n', xlab='', ylab='', bty='n', main=titles$size);
circles <- if (log_s) within(log_pretty(rangepropS, 1), rangepropS, strict=F, wantrng=F) else within(pretty(rangepropS), rangepropS, strict=F, wantrng=F) #all of them
# but I only select five at the most:
circles<-rev(circles[ceiling((1:5)*length(circles)/5)]) # select 5 circles for the legend.
# I may want to get rid of one of the the last two, in case their values are too close
# if (log_s) {if(circles[length(circles)]/circles[length(circles)-1] < 25) {circles<-circles[1:(length(circles)-1)]}}
# else {if((circles[length(circles)]-circles[length(circles)-1]) < 0.7(circles[length(circles)-1]-circles[length(circles)-2])) {circles<-circles[1:(length(circles)-1)]}}
# xlegend<-2
points(rep(-1,length(circles)), circles, pch=21, bg='grey', col='white', cex=prop(invprop(circles, min(circles), max(circles)),min(sizepaltt),max(sizepaltt)))
text(rep(0,length(circles)),circles, labels=sprintf("%5.3g",circles))
}
view raw gistfile1.r hosted with ❤ by GitHub

It's messy and not at all clean - but it gets the job done. This routine is also dependent from several others which define colorscale and other accessory functions... feel free to drop me a line in the comments if you want the lot... Similar plots can be obtained with ggplot2 in much fewer lines, although right now I'm less expert at it so they're much less customised.

9 comments:

  1. Hello Luca,

    That is an incredibly useful and complex bubblechart. Could you provide a sample dataset to use with the bubblechart function?

    Thanks,
    John

    ReplyDelete
  2. Thanks John...

    If I'm not mistaken, most of those graphs were generated using a random set, except for the last one:

    there's a lot of pieces missing from it... I'll try and find back some example code where I define all that's needed and then use the function...

    git://gist.github.com/2712729.git

    (it\'s still missing the .rsave file to load, but you should be able to create a suitable dataset easily - as long as it is called 'Calc', it should work straight out of the box.

    ReplyDelete
  3. thanks real good and useful, Thanks Samuel Bangalore India

    ReplyDelete
  4. Dear Luca,

    great work! Thanks!
    One question: I am somewhat struggling to reproduce the color gradient from the rectangle on the upper right in this plot: http://1.bp.blogspot.com/-12Kqc0jfXr4/T66_COQGGAI/AAAAAAAAlVI/Fbl_cMKwP0A/s1600/Bubbles_in_2_Colors.png.
    Your above code is quite complex and I was not really able to figure out the relevant parts to achieve this. I would like to reproduce this very gradient only. Could you do me a favor and post a simple example of how to produce this? (only the 2D gradient)?
    We are currently discussing this plot on stackoverflow: http://stackoverflow.com/questions/11070101/2d-color-gradient-plot-in-r/11070260#comment14489845_11070260

    Thanks in advance!
    Mark

    ReplyDelete
    Replies
    1. Here is a simple solution, by drawing smaller squares a finer gradient can be made. I think the digital nature makes it a little more readable though

      mycol<-(0:10)*10
      mycol2<-mycol

      plot(c(0, 100), c(0, 100), type= "n", xlab = "", ylab = "")
      for(y in 1:10){
      for(x in 1:10){
      rect(mycol[x], mycol[y], mycol[x]+10, mycol[y]+10, col = rgb(mycol[x]/100, 0.75, mycol[y]/100), border = "transparent")
      }
      }

      Delete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Thanks for the contribution!! These kind of plots are really useful and powerful.

    I’m trying to follow your program, but you use the function log_it and I have not been able to find any reference about it. Can you let me know what is doing or tell me where to find it?

    Thanks a lot,

    Nicolás

    ReplyDelete
    Replies
    1. Thanks Nicolas...

      I looked back in my old files but I can't find it...

      From what I remember, the log_it function would just take the logarithm of its input but checking first that it was positive, to avoid errors...

      The prop and invprop function would instead transform data from an A to B range to a to b... I'd essentially use them to rescale values so they would plot within the limits of had chosen for some portions of the graphs...

      I must one day clean up the code but I'd probably rewrite it from scratch in ggplot right now...

      Delete
  7. Also, what about the function invprop? Thanks again.

    Nicolás

    ReplyDelete