Would you mind providing some reasons on how you find dplyr better than Python/pandas?
I'm genuinely curious, because I just started using pandas in a new job a few weeks ago, and it seems robust enough so far. I glanced through your link and didn't see any key differences.
My gripes thus far about pandas are that it seems a bit verbose sometimes, eg doing groupby's.
And I haven't quite grokked the indexing. As in - I never use indices, I always reset indices after groupby's to get the groupbys as columns. And I find multi-indices a hassle. Eg, if I group by a column and want to get a sum and count or max and min in one go, without the multiindex that requires using a tuple to access the column afterwards.
Oh, and one actually annoying one - grouping by a column that contains NaN's silently drops those rows. Not the behaviour I'd expect, and requires ensuring all groupbys are preceded by fillna's, which adds to the verbosity.
And just thought of another annoyance. Integer columns are silently turned into floats if a row has any NaN's. So your column of integer ID's turned to floats won't join with another table expecting ints (I've had to workaround and turn to strings)
Besides that, pandas seems pretty reasonable. I've found its use of masks to be pretty powerful, for instance.
This example seems a little odd, since it takes a data frame and mutates the x column to be equal (all entries of the column) to the sum of the second component of the y column/vector to the 3rd component of the z column/vector.
It is odd since they whole column turns into one value.
it is quite simple in pandas, but as a said in another comment, your example is a bit weird, since it turns all values of x into the same. But this is how to do it:
If you are not using indices and multi-indices you are missing out in the awesome advantages in using pandas. If you come from an R background, indices (or rownames in R) are a real hassle, and you always want to keep them in columns.
But in pandas they are highly optimized and proof tested, and a breeze to work with once you get a hang of it. They make merging dataframes easy, pivoting easy, data tidying easy, and etc..
However, since you are still learning the api, it can be a pain to use them.
I'm genuinely curious, because I just started using pandas in a new job a few weeks ago, and it seems robust enough so far. I glanced through your link and didn't see any key differences.
My gripes thus far about pandas are that it seems a bit verbose sometimes, eg doing groupby's.
And I haven't quite grokked the indexing. As in - I never use indices, I always reset indices after groupby's to get the groupbys as columns. And I find multi-indices a hassle. Eg, if I group by a column and want to get a sum and count or max and min in one go, without the multiindex that requires using a tuple to access the column afterwards.
Oh, and one actually annoying one - grouping by a column that contains NaN's silently drops those rows. Not the behaviour I'd expect, and requires ensuring all groupbys are preceded by fillna's, which adds to the verbosity.
And just thought of another annoyance. Integer columns are silently turned into floats if a row has any NaN's. So your column of integer ID's turned to floats won't join with another table expecting ints (I've had to workaround and turn to strings)
Besides that, pandas seems pretty reasonable. I've found its use of masks to be pretty powerful, for instance.