Monday, May 14, 2012

Plotting data and distribution simultaneously (with ggplot2)

Ever wanted to see at a glance the distribution of your data across different axes? It happens often to me, and R allows to build a nice plot composition - This is my latest concoction. I used ggplot2 here, but equivalent graphics can be made using either base graphics, or lattice.

The set is the usual 'iris', the central plot has petal length and width along the X/Y axes - I  used a customised color palette so as to be friendlier to color-blind people. On the left and at the top of the main plot, the density distribution of the whole set (grey) and by subspecies.

Well, I hope the code is clear - this time I commented it a bit more...

9 comments:

  1. Nice example, one thing I would nit-pick though is that the area of the reflection of the density estimate for the full distribution is not the same as the area estimate for the sub-groups.

    This is difficult to amend though, as the kernel density estimate will be different for the subgroups than it will be for the total distribution. Some experimentation suggests that using counts instead of density, and then on the reflected distribution using a stacked density estimate (which is only possible when using counts instead of density) constrains the areas to be the same. Example given below using your same 'df' (and I'm sure some more can be done to make it look alittle nicer);

    p5 <- ggplot(df)
    p5 <- p5 + geom_density(aes(x = x, y = ..count.., fill = z, alpha=0.4))
    p5 <- p5 + geom_density(aes(x = x, y = -..count.., col = z, fill=NA), position = "stack")
    p5

    One might also want the density estimates for x and y to be equal as well, but I'm not sure how to accomplish that.

    One could probably also make a case for the density estimate to be scaled (allows comparisons between sub-groups more easily), but I think first blush you would not want to do this.

    Also, in the future I would appreciate it if you made the code completely self contained (such as including the data(iris) statement, as well as the code used to make the color palette). Also in my current version I needed to load the grid library.

    Thank you for the example!
    Andy

    ReplyDelete
  2. Thank you for your comments, Andrew. Very insightful.

    I am aware of the issues of scaling between the 'overall' density and the 'by group' density - I'll try to update the plot and code to reflect your suggestions.

    From now on I'll do my best to make the code totally self-contained, too...

    ReplyDelete
    Replies
    1. Thanks for such a quick update, and keep the good examples coming!

      Delete
  3. Hello Luca,

    Very interesting graph. Could you provide the code necessary to generate the palette (i.e. cbnbPalette).

    Thanks,
    John

    ReplyDelete
  4. Code's been added - I took it (and adapted it) from http://wiki.stdout.org/rcookbook/Graphs/Colors%20(ggplot2)/#a-colorblind-friendly-palette

    ReplyDelete
  5. Luca,

    I love the plot! I have two suggestions:

    1) move the legend to the top right area vplayout1(1,5)(..should be pretty easy, right)

    2) separate the two density plots (grey and specular) with a little whitespace (because I find it essential to have the reference straight-line when looking at density plots). (that seems harder).

    What do you think of that?
    -Paul

    ReplyDelete
  6. Thanks Paul,

    I'll try to move the legend, although I may have to create a new empty plot to stick it up there... I didn't know until yesterday of the existence of geom_blank() which should let me do exactly that...

    As for suggestion number two, it may be a bit tough... In base graphics I'd use 'arrows' to translate the plots up and down... in ggplot I may be able to fudge it... I'll give it a go together with Andrew suggestions tomorrow...

    ReplyDelete
    Replies
    1. I did insert all changes easy to do...

      I'll think about the dividing line later, Paul...

      Delete
  7. Dear Sir,

    Thank you for the excellent post.

    What is the code for:

    col=subsp

    ReplyDelete