Seeking new test ideas for Framework Benchmarks Round 4+

CoffeeDregs · on April 12, 2013

As a passive consumer of the benchmarks/info, I've got to say a huge "Thanks" for this! I've been a watcher of the Debian benchmarks game for a long time (since its inception), but, after using Python, Ruby, Haskell, etc, had written off JVM languages as either slow (e.g. Groovy) or non-expressive (e.g. Java)... This benchmark has got me seriously interested in JVM languages again. (Note: I don't want to use Scala^H^H^H^H^HFrankenstein)

Given the computational ability of browsers, the "front end" is rich and the "back end" is shrinking down to just the API. I'm not sure that the extra verbosity of Java is too large a cost for the performance it offers.

I currently have an application doing 1500 requests a second on Django and it's using 3 servers. I could have used only one server if I'd used Java or Clojure? Hmmm...

bhauer · on April 12, 2013

We've got more Haskell, some Erlang, Lua, more Scala, a bunch of PHP, and myriad other goodies in Round 3 thanks to a bunch of pull requests. I think Round 3 is going to show a lot more competition at the top tiers.

voidfunc · on April 12, 2013

The Groovy is slow argument has sort of run out of steam in my opinion. Support for invok-dynamic and the static compilation annotation provide a lot of speed. I use Groovy daily and my experience is anecdotal but maybe it is time to do some benchmarks of my own.

vorg · on April 12, 2013

Dynamic-mode Groovy is as slow as other dynamic languages. The recently added static-compilation mode for Groovy still regularly spews up bugs, perhaps because it was written by a single programmer with little beta testing whereas Scala and Java have many high-pedigree developers behind them, and heaps of documentation and testing.

coolsunglasses · on April 12, 2013

> I could have used only one server if I'd used Java or Clojure? Hmmm...

Use Clojure, but keep in mind that if you're not caching aggressively you can saturate persistence-side quickly - making your choice of application language irrelevant.

spullara · on April 12, 2013

Sadly that isn't very true anymore. Databases (and I/O — think SSD) really are faster than most of the dynamic languages these days.

coolsunglasses · on April 12, 2013

That's a pretty nonsense sentiment. How can a database be "faster" than a language?

spullara · on April 12, 2013

For example, going to the database and retrieving the data uses less CPU and time than deserializing the data into an object into Ruby. That is the sense I am talking about. The bottleneck becomes the code talking to the database rather than the database itself.

coolsunglasses · on April 12, 2013

That's...that's not how it works. I was talking about the scalability of the database as that's a brick wall you hit long before anything else. You can always fire up more rails workers, scaling persistence is a PITA.

Are you intentionally misunderstanding me in order to make a strange and irrelevant point?

spullara · on April 12, 2013

I'm not making a pure scalability argument. Your solution of scaling the rails tier still suffers from far higher latency thus making choice of language/VM relevant. Rails is very, very slow — even if you cache everything behind it. So yes, you can scale it up, but it is at least 10x more expensive to do so.

coolsunglasses · on April 12, 2013

Can't wait to see how Go 1.1 does in round 3 :)

For anybody else interested in nonsense like this I have a Clojure template benchmark repository on Github here:

https://github.com/bitemyapp/clojure-template-benchmarks

I should probably update the clabango benchmark, some changes were made not too long ago.

bhauer · on April 12, 2013

My oh my those are some awesome sunglasses on that bear.

We've got Go 1.1 in Round 3 and it's amazing.

bhauer · on April 12, 2013

The community has contributed even more pull requests since Round 2 of our web application framework benchmarks. We're planning to start Round 3 tests on Monday 4/15 using a new build of Wrk that allows time-limited tests (rather than request-count limited) so that all frameworks are tested for a uniform amount of time.

In the meantime, this Github issue request is seeking thoughts you may have concerning additional simple tests that we can introduce in Round 4 and beyond. We want to define tests that continue to exercise typical web application functionality but remain fairly simple to implement on a ever-widening field of frameworks.

If you have thoughts, please add them here or at Github. Thanks!

Terretta · on April 12, 2013

Thoughts of incremental web app functionality to test:

1. Exercising a randomized mix of reading and writing. I think you already said you were planning a CRUD test. Consider a tunable ratio here, something like 10000 R to 100 U to 10 C to 1 D.

2. Exercising synchronous web service (JSONP) calls in two modes: (a) to some web service that is consistently fast and low latency, say, the initial JSON example from this test suite running in servlet mode, and (b) to a web service written in the same framework as the one being tested, again using the initial JSON example.

(The idea here is that many frameworks fall on their faces when confronted with latency. This is why synthetic tests are usually so poorly predictive of real world behavior -- people forget that latency causes backlogs and backlogs cause all parts of the stack to misbehave in interesting ways.)

3. Test async ability if the framework has it, with a system call (sleep?) that takes a randomized 0 - 60 seconds to return. Would help understand when a framework is likely to blow up calling out to a credit card processor, doing server side image processing, etc.

4. Exercising authentication (standardize on bcrypt, but only create passwords on 1 in 10K requests), authorization, and session state, if offered.

5. Exercising any built-in support for caching, where 1 in rand(X) requests invalidates the DB query cache, 1 in rand(X) requests invalidates the WS call cache, 1 in rand(X) requests invalidates the long term async system call cache, and 1 in rand(Y) requests blows away the whole cache.

For the enterprise legacy integrators, it would also be interesting to test XML as well (in particular, SOAP), anywhere we're testing JSON.

bhauer · on April 12, 2013

This is great input, Terretta. Exactly the kind of thinking I wanted to tap into.

Some quick thoughts in response:

For #1, the conceptual test included reading and writing in a 1:1 R:W ratio (well, to be more accurate 1:1 R:U). I like the idea of extending this a little bit to include C and D. For the sake of benchmark run-time, I'm looking to restrain the growth of permutations. But on the other hand, I like your idea of a tunable ratio. Something to think about!

I like #2 and #3 as well. I'll think about those some more too.

I really like the idea of incorporating some bcrypt and session state (#4).

We have a few caching tests in mind, but like elsewhere, we'll start out simple and then add complexity.

Thanks for the great input! This planning is to have some good long-term ideas in mind.

wheaties · on April 12, 2013

I have got to do a fork/pull before then in my infinite spare time but a quick question, why only ORM? Excuse my ignorance but there are almost no ORM that I've used which are not materially slower than straight SQL or json, the exception being g Salat on MongoDB. I wonder what not having an ORM would do to some of the poor performance experienced.

bhauer · on April 12, 2013

We expect that most modern webapps are (still) developed with an ORM, but we're not ORM hard-liners. A few of the tests are run without an ORM and are identified with the suffix "raw."

voidlogic · on April 12, 2013

What about tracking memory usage? Peak, average, etc.

Previously you tested EC2 vs Local HW. What about adding local KVM virtual machine as well?

I also think a graph showing latency as a function of concurrency would be very interesting.

bhauer · on April 12, 2013

Thanks for the ideas, voidlogic. We do want to capture server statistics and have that as an issue in Github [1]. I am particularly interested in capturing CPU and I/O utilization because in spot checks, we've observed some frameworks do not fully saturate the 8 HT cores on our i7 hardware, suggesting lock or resource contention.

As for a variety of other hardware and VM environments, the data would be interesting. Related: we plan to migrate the charts and tables to a stand-alone page. Right now, the blog entries are hard-coded to fetch two specific results.json files for rendering the charts/tables. But when we build a stand-alone page, I would like to enhance the script to allow selection of one or two results.json files from a menu for comparing side-by-side. And to your point, the community could then contribute their own results files. Imagine being able to compare EC2 large vs xlarge or vs Xeon E5s or ...

Right now, as you noticed, the latency is only displayed at 256 concurrency. I'll make a note to myself to include a chart for latency versus concurrency when we move to a stand-alone page [2].

[1] https://github.com/TechEmpower/FrameworkBenchmarks/issues/10...

[2] https://github.com/TechEmpower/FrameworkBenchmarks/issues/14...

ckluis · on April 12, 2013

@bhauer - this may be the best marketing I have ever seen a company do in the tech space (and by marketing I mean marketing to developers). This is a recruiting goldmine.

Sheer genius.

happyhappy007 · on April 12, 2013

I would like to see a simple CRUD blog built with different frameworks. Building a blog is like the "Hello world!" for dynamic web development.

coolsunglasses · on April 12, 2013

Given the scope and breadth involved in these benchmarks, that's a helluva tall order. I'm sure nothing's stopping anybody from doing it themselves though.

kbenson · on April 12, 2013

I think that this could be specced out in stages, and implemented an a number of rounds. First would be a schema for a blog, with authors, posts an comments. Next would be a rest API for posts and comments. Finally mock pages to be used for posting, reading and commenting in HTML to test the included templating system, if there is one.

You really need to go at least this far. This will also give you an approximate code size for thus sample project as well, which is is at least as important as performance to some people.

Terretta · on April 12, 2013

> You really need to go at least this far.

"The framework don't care."

What I mean is, the framework doesn't know if you're building the result of 20 queries into a blog post page that pulled in related data from the post itself, the author profile, and the comments and commenter profiles, or if you're pulling in arbitrary data. So there's no reason to test a "blog". Most of us aren't building blogs. But we are interested in querying databases, calling web services, cached performance, and async process queue handling.

kbenson · on April 12, 2013

Except that I, and I'm sure many other people, are interested in more than just performance. I want to know how much code it is to achieve some small subset of usefulness, and what it looks like. Is it overly complex? Is it split apart in a paradigm that doesn't match my mental model very well?

I agree most of us aren't building blogs (I'm not), but I believe a blog is a reasonable stand in for a more complex application. It obviously won't test everything, but the requirements are well understood (or can be well understood, if defined well enough).

Also, who's to say that some of these frameworks aren't going to perform significantly worse when they start having to do more than simply serialize data as JSON across a socket? With that in mind, how accurate are some of these benchmarks if they aren't set up and used how they would be in real life.

coolsunglasses · on April 12, 2013

>You really need to go at least this far.

Oh please. Nobody's ever done such a comparison across more than 2 or 3 languages or frameworks. Have you even seen the benchmarks in question?

Why not just go do it, if it's so essential and straightforward?

kbenson · on April 12, 2013

> Oh please. Nobody's ever done such a comparison across more than 2 or 3 languages or frameworks. Have you even seen the benchmarks in question?

Okay, let me clarify. When I say "you really need to go this far" I mean that stopping at any point before that (but after the simple metric of how many requests a second it can serve which they already do) makes no sense, IMHO. If you are going to compare frameworks and you want to go beyond that initial performance metric, you might as well aim high enough to be useful.

I agree you never see anything approaching that in other reviews/benchmarks. Is that a good reason to not try it here?

> Why not just go do it, if it's so essential and straightforward?

I'm envisioning this as a community process, not a "Go off and write this in 20 frameworks or you're useless" sort or ultimatum. As such, just speccing out a possibly route is helping.

Also, I plan to help with the existing benchmarks. After the second round, I pointed them out to the author of my favorite framework in the hopes he would have time to put together something for the benchmark, otherwise I was going to in the next week or two when I had time. I still plan to.

Oh, and that framework author's answer? That these benchmarks are laughable because all they measure is performance, and there's a clear performance to convenience trade-off shown in the results, and that of course there's a performance hit when the framework handles most the work for you. I have to say I agree. Sure, there's possibly some that are clear winners giving good performance with lots of conveniences for common operations, but is there any way to tell as much from the data presented so far?

Terretta · on April 12, 2013

> I mean that stopping at any point before that (but after the simple metric of how many requests a second it can serve which they already do) makes no sense, IMHO. If you are going to compare frameworks and you want to go beyond that initial performance metric, you might as well aim high enough to be useful.

The query 20 random rows, build an object, and return that, is way beyond requests per second. Enough that the list is dramatically re-ordered.

> That these benchmarks are laughable because all they measure is performance, and there's a clear performance to convenience trade-off shown in the results, and that of course there's a performance hit when the framework handles most the work for you. I have to say I agree.

On the contrary, performance versus magic is absolutely not a "given" here. Yes, some bare languages are near the top, but there are also heavy frameworks (e.g. Spring) performing well, and lean frameworks performing poorly.

As for being "laughable", that makes me suspicious of the framework author's understanding of where optimization needs to happen. Presumably, pages will be run more often than they are authored. Presumably there's a recurring bill for the server farm. Optimizing for performance helps end users stay happy, and helps the company stay in business able to continue employing developers.

> possibly some that are clear winners giving good performance with lots of conveniences for common operations, but is there any way to tell as much from the data presented so far?

I agree with that point. I would like to see 3 additional columns added to the results: LOC, number of src files/templates across number of directories, and number of libs.

This helps suss out your point: how much does a coder have to type in this framework, and how much incidental complexity (files, libs) do they have to wrap their heads around?

Consider:

    [300 loc; 15 files  2 dirs;  8 libs]
    [650 loc;  1 file   1 dir;   0 libs]
    [150 loc; 15 files 15 dirs; 23 libs]

Multiply by 10 to imagine real world code, and these would each feel very different to an author, and to a new hire hired to maintain a project already in production.

kbenson · on April 12, 2013

> As for being "laughable", that makes me suspicious of the framework author's understanding of where optimization needs to happen. Presumably, pages will be run more often than they are authored. Presumably there's a recurring bill for the server farm. Optimizing for performance helps end users stay happy, and helps the company stay in business able to continue employing developers.

Well, where optimization needs to happen is highly dependent on the business, and where that business is within it's lifecycle. Sometimes a higher server farm bill is preferable to some impediment to the developers, because the developers don't scale as quickly. Personally I would much rather throw twice as much hardware at something than to work twice as long (at least initially), but obviously it's not as cut and dry as that.

> The query 20 random rows, build an object, and return that, is way beyond requests per second. Enough that the list is dramatically re-ordered.

> This helps suss out your point: how much does a coder have to type in this framework, and how much incidental complexity (files, libs) do they have to wrap their heads around?

The reason I suggested a blog is to also exercise whatever templating system the framework ships with, if any. Otherwise, what the standard templating system is for the target language.

True, this also could be tested just by making something up and displaying those random 20 rows in some manner, but at this point, why not just use a simple spec for a blog? I think it's almost the same amount of work, and IMO will result in more consistent implementations.

That said, I think we are one the same page, more or less.

Edit: s/work twice as hard/work twice as long/ because it maps more closely to what I meant to express.

bhauer · on April 12, 2013

To your last point, Terretta, my colleagues and I have had lengthy (but unfortunately inconclusive) discussions about how to represent the efficiency dimension that we expect many readers would like to visualize.

The easier dimension--performance--became our first goal and the source code in Github is expected to very loosely fill the role of answering questions of efficiency. But we know that's a barely serviceable solution to the challenge, and that is especially true as pull requests increase the number of frameworks.

The challenge of representing efficiency succinctly remains.

We have considered lines of code and I think that for all its weaknesses, that is the best proxy for efficiency that I am aware of. Nevertheless, I'd like to contend with some of the following specific issues:

1. Many frameworks create boilerplate when you create a new application. Do we count those LOC?

2. Many frameworks on dynamic languages copy their entire corpus of functionality as source files into the application's root. Certainly we don't count those. Check out our Github repo's colored language bar at the top right. :)

3. Do build scripts count as LOC?

4. Do configuration files count as LOC?

Ultimately, I retain (irrational?) fear of contributing to enshrining LOC as a metric because unlike performance--where higher is always definitively better--lower LOC is not always definitively better.

I was asked separately if we had any data about how long it took us to build tests in the various frameworks. We didn't take detailed logs, but we do have rough numbers. Nevertheless, I've been reluctant to share how long it took us to implement the test code because it's a biased sample. Our previous experiences make us well disposed to some platforms and languages while we make silly time-consuming mistakes on others.

kbenson · on April 12, 2013

This is how I would answer these questions:

> 1. Many frameworks create boilerplate when you create a new application. Do we count those LOC?

No, as long as the lines did need to be changed. Any line you must alter (not just add code before/after) should be considered a line of code you had to write to get to a functioning implementation. You had to understand the line before altering, so could have written it entirely yourself.

As such, diffing against a reference boilerplate file is a good indication of lines required.

> 2. Many frameworks on dynamic languages copy their entire corpus of functionality as source files into the application's root. Certainly we don't count those. Check out our Github repo's colored language bar at the top right. :)

Take whatever you are given as a boilerplate starting point for a new project, and diff the final implementation tree against that. Special care may need to be given to implementation that default to installing third party libs into the local framework lib path (if any exist), so those are not counted.

> 3. Do build scripts count as LOC?

No, I would think not.

> 4. Do configuration files count as LOC?

This is a bit more complicated, but for simplicity's sake it may be easiest to treat it just like any other reference file as in #1 and #2.

> Ultimately, I retain (irrational?) fear of contributing to enshrining LOC as a metric because unlike performance--where higher is always definitively better--lower LOC is not always definitively better.

I agree, but without something better, it's what we have.

I do think there's ways we could get to a more useful metric using relative lines of code required to implement something between languages, and then compare relative LOC to host language, but that's way outside the scope of this benchmark, and requires information we don't have (to my satisfaction) yet.

bhauer · on April 12, 2013

Thanks for the thoughts here, kbenson.

I like the idea of using a diff be judge the total LOC. That reminds me that before we did the initial commit to Github, we had briefly entertained doing more or less that: we were going to commit the initial boilerplate for each framework as commit #1 and then the resulting tests as commit #2. To save on effort, we ultimately did not do that, but it is possible for us to go back to our original local Git repo to glean that information.

Using that approach, I think even build scripts could be included--that is, if you need to modify the build script and it shows up in a diff, then that line counts.

Terretta · on April 12, 2013

1. Many frameworks create boilerplate when you create a new application. Do we count those LOC?

Yes if it's code someone could edit. Someone unfamiliar with the framework needs to read these lines of code. Boilerplate may be "pre-typed" for you, but it's part of "your" app, and it's optional. That's not the same as a lib or the framework itself. You could start your app differently.

The new hire has to deal with these LOC regardless of who typed them (you or a wizard), so they're significant.

2. Many frameworks on dynamic languages copy their entire corpus of functionality as source files into the application's root. Certainly we don't count those. Check out our Github repo's colored language bar at the top right.

Agree this should not be counted. Only what's part of the app in question. Put another way, if those directories could be "hidden" from the new hire, and he could do his job, don't count them.

3. Do build scripts count as LOC?

No.

4. Do configuration files count as LOC?

Yes. The Spring framework's pom.xml or any dependency injection "configuration" file is crucial, so yes. On configs, though, I'd guess kbenson's idea about diff is reasonable.

- - -

However, about diff on #1 above, I'm concerned that with diff you're really just measuring how different the test is from the boilerplate. That says more about the test than about the framework. Someone could also optimize their framework's pre-typed boilerplate to match your test. (As we see on browser rendering benchmarks, for example.)

Really, if it's code that's typed (by the framework or by you) into an application, it should be counted.

I'm coming to this from the standpoint of hiring a new developer onto a project. More time is spent maintaining projects than jumpstarting them. At some point really soon after jumpstarting the app, boilerplate and edited code are going to be intermingled and indistinguishable. The new guy has to wrap his head around it all. So I would like any code typed by you or for you to be counted.

And beyond LOC, it matters how those lines are distributed. If you have to open five nested files to find out what one hello world is doing, even if each file only has a single 3 line function in it, that's complexity that matters. Number of files and number of directories both make a framework feel very different.

Most benchmarks are for weekend MVPers. Your benchmarks are moving into "real world" territory. In the real world, being able to wrap your head around someone else's existing project matters. LOC, files, directories, and dependencies, factor into that heavily.

kbenson · on April 12, 2013

> The new hire has to deal with these LOC regardless of who typed them (you or a wizard), so they're significant.

That's true. Maybe a two numbers, one with and one without boilerplate? That way someone could get an idea of what they are looking at. (I know, I'm just stacking more and more work up...)

crypto5 · on April 12, 2013

Do we have any chance to look at round 3 results?

bhauer · on April 12, 2013

Of course! We are starting the tests next Monday and you can expect them to be complete some time around the middle of next week.

darkchasma · on April 12, 2013

ASP.Net MVC?

bhauer · on April 12, 2013

Sadly, we still haven't fit this in and no pull requests to-date. A Mono pull request would cause heartfelt cheers.

polskibus · on April 12, 2013

+1 for the .NET , both mono and microsoft versions. It would be great to compare ASP.NET MVC against the most popular web frameworks out in the open source world. If it's a matter of licenses then perhaps you could apply for bizspark and get windows licenses for dev purposes for free?