Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ggplot2 version 2.0.0 released by Hadley Wickham (rstudio.org)
118 points by rz2k on Dec 22, 2015 | hide | past | favorite | 48 comments


> instead, we now use ggproto, a new OO system designed specifically for ggplot2

And this is why I gave up on R for other things than ad hoc data exploration (don't get me wrong; it's killer for that). When I finally learned there were no less than four system of OO buried in base R, and still people were unhappy and inventing more, I realised that as far as well structured programming goes, R is kind of beyond help at this point.

From what I can tell, my biggest disappointment with ggplot2 is not addressed - the fact that ggplot evaluates all references not found in the data frame to be plotted in the global namespace. The result is that you can't actually make reusable, composable plots in a reasonable sane way. It sends the whole illusion of composability that ggplot()'s '+' syntax suggests might be there right down the tubes. My plots were all more reusuable before I ported them to ggplot (ie, in base graphics).

Apologies for the whining, and congrats to Hadley!


My biggest problem with ggplot is that it's slow, unbearably slow. It also enforces all data to be realized into a single dataframe, which is true only for small (as is: fits in memory) datasets.

Very often, to produce specialized plots, I have to send data to the canvas in chunks by performing pre-processing myself. ggplot really doesn't work in this scenario. Combined with the general slowness, it forces me to use alternatives quite frequently.

It's a bummer, really, because I'd like my plots to have a consistent visual style, and doing that across different plotting packages is an issue.

I very often resort to gnuplot when it comes to huge datasets and/or incremental plotting. The same is true also in python (matplotlib is also very slow, independently of the backend). But at least, if you use seaborn (https://github.com/mwaskom/seaborn), you can easily intermix the easiness of plotting through a DataFrame or just supply data arrays.

ggplot is really awesome for what it does, but 1) the syntax doesn't really please me (feels just plainly forced onto the wrong context 2) doesn't scale, which forces me to use alternatives too frequently 3) trying to customize the plot style beyond a few minor tweaks is pure hell.


At the end of the day, the whole point of ggplot is to produce a graphical representation of some aspect of the data. How much information can you possibly cram into ONE graphic and have it be readable by a human? Your problem is really a data reduction problem and not a plotting/graphics problem.


The data that goes into the plot is unrelated to it's visual complexity.

The "problem" is that ggplot also takes care of the transformation/reduction step for you.

For example, a KDE plot can source potentially a limitless amount of data while still generating a very simple plot. Likewise for most smoothers.

However, if I have to produce the kde/smoothed line myself, I lose almost all advantages of using ggplot (I have to manually calculate the visual density, scaling and attaching labels is another PITA).

On top of that, as other have said, ggplot really struggles already with thousands of entries. A simple 5x5 faceted scatterplot with ~10k points might take seconds to render on recent hardware. When I plot data interactively for exploration, I might do this hundreds of times a day. I lose all the convenience just in the time wasted for rendering.


This is a very valid point that I feel we often overlook. Most folks don't think like a statistician, and over-complicating figures is the best way to render them useless. All is lost if your audience can't understand what you're trying to convey (:


Which is why ggvis has a different architecture which will make working with large datasets easier. Not to mention the pipe as a unifying interface across ggvis and dplyr which makes it easier to do efficient data manipulation within the visualisation.


Most of the world's data analysis still happens on data sets that are at most a couple of thousand observations. It'd be neat if ggplot2 was faster for large datasets, but I can imagine that's not a priority.


The last time I measured the runtime, it was slow even for medium sized data sets -- in comparison to lattice or to R's "native" plot functions. It's probably not a problem for interactive use but it becomes annoying when generating reports with many visualizations.


actually, there is a trick that allows you to store the state of your visualisation as you are adding layers. you don't need to have all your data in just one dataframe.

http://koaning.io/casino-gambling-simulations-in-r.html


ggplot is slow (though not as slow as it was 5 or so years ago) and Matplotlib is even slower.

I took to liking the ggplot syntax immediately. Specifically what do you find odd/forced about it?


I get the complaint about one more OO system, but I did consider all the options and I'm pretty sure that this was the best solution, given how few people will be exposed to it.

There's also a legitimate learning challenge: if I want to figure out which OO system is best for most projects, I have to try them all out. And then the payoff for translating to a common system is small.

That said, for new projects I pretty much use S3 and R6 exclusively. (And it is legitimate to have two systems because they are so different and R6 is mostly needed for internal mutability)


And yet R is really a Functional Programming Language. This is why people are making OO code that really is garbage. BUT to say I gave up on a language because it has so many options is strange to me.


"BUT to say I gave up on a language because it has so many options is strange to me."

This is a regular concern in designing a programming language. Give coders too much freedom and they invent 20 ways to do the same thing, none really clearly better than the other. Restrict the freedom and there's more like 3-4 ways to do something, and usually one is better then the rest.

As a new user to R, I must say that the single hardest thing has been learning all the different ways to do something. Each blog/stackoverflow/etc... article I read seems to propose 5 different ways to do each thing. At the end of it, I'm left not really knowing how things work best.


There is a weirdness to R. I found that taking a lot of time to focus on the basics helps.

If you're committed to using R, read John Chamber's book and also look for answers on the old "R mailing list" (a place so mean, it makes stackoverflow look like mister roger's neighborhood).


Working in the Hadley Wickham's Universe with dplyr ggplot2/ggvis and stringr really helps a ton.


Absolutely, and don't forget tidyr! Its hard to overstate Mr Wickham's contribution to making R a very handy and useful tool. I am sure that a lot of folks would have tried R and then moved on to something else were it not for the "Hadley-verse".


It is the simple Functional language we want when working with data.


If you're new to R, I'd highly recommend learning everything using Hadley Wickham's packages. When I was first getting started I had the same problem, and IMO Hadley's solutions offer the best maintainability. Especially with how quickly R is growing, the time that Hadley has invested in teaching makes me believe that his approach will continue to be one of the guiding influences for the future of the language. Just my two cents though!


And what do you use that even remotely approaches the breadth and depth of tools available in R, but solves this problem?


88% python + 10% calling R from python + 2% porting R code to python.

edit: that being said, ggplot is still better than anything on the python side of things.


I wish I had a good answer to that, but I don't. Python is taking a good shot at it. But it doesn't match R by any means.


seaborn, bokeh, pandas' built in methods, and plotly fill in the ggplot gap for me.


So perhaps we ought be working on building solutions that bridge the Python-R gap? Because as someone who has effectively left Python, I'd love to learn from solutions Python users find effective. After all, it's not like one language is the sole proprietor of useful approaches (;


I'm not saying you're wrong, but this sounds exactly like the arguments made supporting Perl, which is an interesting comparison.


It is an interesting comparison for sure, but things are different for two reasons:

1) Software and related services will grow immensely over the next decade because of the "cloud". Some of the growth will be same old, same old application logic software, but most of it will be data analysis software.

2) The language of choice for data analysis software is and will be R due to the biggest software companies (both #1 and #2, among many others) standing firmly behind it along with universities incorporating it in its curriculum, ditching SPSS and SAS.


Well, both are pretty great languages (not without issues, but what is?), and there are a lot of similarities -- "more than one way to do it" of course, and CRAN and CPAN have much in common (I do wish CRAN took a few lessons from CPAN though).


If you have the time, could you please shoot me an email with the lessons you think CRAN could learn from CPAN? Thanks!


have we talked about CPAN testers?


I'm pretty sure the evaluation bugs have been fixed for a while (subject to some caveats). Unfortunately it's impossible to fix fully in ggplot2 without a massive rewrite, but ggvis doesn't suffer from the same problems. Now I actually know how to do non-standard evaluation correctly :)


I've spent a lot of time working with ggplot2 for making advanced data visualizations. For reference, I wrote a nice tutorial on how ggplot2 can be used to make nice charts easily and effectively: http://minimaxir.com/2015/02/ggplot-tutorial/

To put the scope of the changes in the 2.0.0 release in perspective, I suspect that the simple tutorial is now broken, let alone the code for my more intricate charts. I'm not upset about it through, since all the changes, especially breaking ones, are well-reasoned and well-documented. Hadley did a great job of explaining everything.

I'll have to spend some time diving into ggproto.


I just wanted to compliment you on the thorough job you did on the ggplot tutorial. In my wanderings, I haven't found anything that comes close, (the great) Dr. Wickham's materials included.


Haha, thanks. I may have to do a follow-up, especially if ggplot2 2.0.0 is as breaking as I thought.


If it does break stuff that you think it shouldn't please let me know. Goal is to do a patch release in a month or two to fix anything I broke accidentally.


Mostly the geom_bar and geom_histogram split, which I agree was a good idea.

I also used the order parameter on a few visualizations, so I'm unsure how to order a stacked bar chart without it. I'll give it another look and see what I find and file as appropriate.


Thanks! I think I have have underestimated how many people confused bar charts and histograms :/


Well the only difference seems to be whether the x axis is ordinal or continuous.


Oh man, where do I even begin?

Error in add_ggplot(e1, e2, e2name) : could not find function "is.coord"

Error in (function (el, elname) : "panel.ontop" is not a valid theme element name...

Error in layer(mapping = structure(list(x = element$Line, y = 0, xend = element$Line, : unused arguments (arrow = NULL, lineend = "butt", na.rm = FALSE, colour = "grey50", linetype = 2)

Error : Unknown parameters: guides

Error in FUN(X[[i]], ...) : attempt to apply non-function

Error in coord_tern() : could not find function "coord"

I just can't keep up with this - most of these errors are with objects I am not using (such as 'panel.ontop' when using theme(legend.position="bottom"). For those of us who chose (made the mistake?) of using ggplot2 with larger projects such as shiny, large updates like this which render all our past code mute presents a dilemma - do we keep re-learning this wheel or just stop updating altogether? The latter seems to be the only feasible course. Every plot implementation I had on my website was broken by this update.

I don't mean to be mean, ggplot2 is a wonderful package, but this update presented all headaches.


Please file bugs on GitHub. It looks like you're using packages built on top of ggplot2 - in particular the maintainer of ggtern seems to have disappeared.

Otherwise I'd highly recommend using packrat or similar so you can choose on a project by project basis when you want to upgrade selected packages.


Extended release notes here[1] as well as an explanation of the new object system for extending it, ggproto, here[2]

[1] https://github.com/hadley/ggplot2/releases/tag/v2.0.0

[2] https://cran.r-project.org/web/packages/ggplot2/vignettes/ex...


> although I’ve tried to minimise breakage as much as possible

The R community really needs to grasp version control head on. There are some great tools, and using git or similar works just fine, but their central packaging system only knows about the latest version of everything.


This is actually a huge strength for the R community, who by and large are not programmers and just want stuff to work. The fact that the latest version of all CRAN packages actually work together makes life much easier. And for experts there are tools like packrat and checkpoint that give you more control.


I'm sorry, but this mindset is very harmful. If I need an old version of a package and I didn't save it off somewhere first, it simply cannot be found.


Huh? There's an archive link on the CRAN page for every package; e.g. ggplot2 back to 2007: https://cran.r-project.org/src/contrib/Archive/ggplot2/


packrat and especially checkpoint to the rescue!


Checkpoint only allows specifying that all packages as they existed on a particular data in CRAN be downloaded and utilized within a project. This is limiting in that an R developer may want to utilize very specific versions of packages that span multiple epochs.

I would really like to see R develop functionality akin to Maven or SBT, such that an R developer can explicitly specify the exact versions of all dependencies, which will then be installed at the first run.


That's the goal of packrat. (But it's still a work in progress)


As someone who has suffered through a few different smallish projects with R, I don't think it would be even remotely worth using without ggplot2 and Hadley's other amazing contributions, especially dplyr and rvest. Easy to find flaws in R, but ggplot2 makes it easy-ish to produce nice-looking plots.


Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: