Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The dark side of GraphQL: performance (twitter.com/benawad)
199 points by kamranahmedse on Jan 1, 2020 | hide | past | favorite | 78 comments


The title is misleading. The post doesn't discover any dark sides of GraphQL. The post is about a potential performance problem with a library that implements the GraphQL spec. There might be a problem with the library itself. There might be a problem with the use of said library. The author states that it takes 19ms to fetch 20 recipes from a postgres database. This looks really suspicious. Why does it take so long to fetch 20 indexed rows? Maybe there's some general performance problem with the application?


You focus on and make assumptions it's indexed rows and that ~20ms average for a database call is "suspicious" but you're not concerned about the 400ms flamechart for graphql-js doing validation shown in the thread?

graphql-js is the reference implementation of GraphQL, so it's not any random library.


The graphql-js library focuses on correctness, not on performance. Facebook doesn't invoke it at runtime, only at built time. They use persistent queries only. If you want a high performance server runtime I wouldn't use Node.JS. Especially for a complicated task like validating and resolving a GraphQL query Node.JS is the wrong tool. It's too high level to tweak hot paths and optimize the garbage collector. So no, I don't think the flame graph is suspicious. In my language of choice (go) I could drill down memory and CPU consumption for each line of code to find the bottleneck. Maybe this is possible for Node.JS too, I don't know the tooling so well. I would suggest that if such tool exists a detailed flame graph of the Node.JS application might help understand the issue.


OP is using Apollo Server, which is by far the most common server implementation for GraphQL. It may well be there are issues specific to Apollo, but it's definitely worth getting to the bottom of based on how widely used Apollo is.

There is nothing in the posts that identifies NodeJS as the culprit, and based on the info I'd be very surprised if it was. It seems most likely that the type validation is what is taking so much time. But then again, strong types are one of the main benefits of GraphQL. If anything, I've found Node to be one of the easiest and most "natural" server languages for GraphQL, and I have implemented GraphQL servers in Node, Java and Python.


Have you tried elixir's absinthe?


I'm curious to find any performance comparisons between Elixir's Absinthe and Hasura, on a Phoenix app.


I would be extremely (but prepared to be) surprised if Phoenix + Hasura were faster in terms of latency than Phoenix + Absinthe, since one has to enter and exit two vms and the other doesn't, unless you're suggesting the frontend issue graphql results directly to the hasura backend, bypassing phoenix.


Two ways one could test - 1] REST client <-> Hasura <-> Phoenix 2] Phoenix-generated HTML <-> Hasura <-> Phoenix


The last comment specifically shows that there is one big part of degradation that depends on the usage of promises. This is either a platform or library problem, but surely not a GraphQL problem at all.


Regarding profiling in Node.JS (in case anyone who is interested is unaware,) if you start your application with the "--inspect" argument and then open devtools in chrome/chromium there's a little node icon that shows up in the top left corner.

If you click on that you can get performance flame graphs / tables, memory profiling, and there's also a REPL for the process, as well as a list of loaded source files so you can set breakpoints through there if you like, as well as modify the files on the fly if you need something more for debugging.

It can be very useful, and works pretty much the same as the normal web devtools.


There is also ndb for a very similar effect. What I really like about ndb is it works in front of just about anything, for example `ndb yarn test`

https://github.com/GoogleChromeLabs/ndb


This comment neglects the fact that people can and do use node in situations where performance is important, and 400ms is especially egregious. Having the obvious path for using graphql in js server side perform terrible is problematic.


For a point of comparison, when I was using graphql-js around 3 years ago, I was benchmarking things pretty carefully and the main bottleneck wasn't graphql-js -- I had comparable (or faster) response times to equivalent existing hand-crafted JSON endpoints.

But if you're fetching a lot more data than you need for a typical UI, you might run into bottlenecks.


Yeah, it sounds like the poster is having issues with their deep hierarchy causing lots of unnecessary object creation?

I'm still surprised it had that much overhead though, as I generally think of node as being relatively competitive with Java outside of heavy computation.


Deep queries aren't usually a problem as it's usually just waiting on I/O, it's wide queries that are a bigger concern. I just remember that CPU load was very low compared to other metrics, so I mostly focused on fixing memory (which was related to the file descriptor issue) usage.

One issue I see a lot of (which happens to be part of the problem in this case) is people overfetching data on their list UI (e.g. search results) so that the detail page data is already in the client side store. But this is usually a premature optimisation and ends up causing more problems than it solves.


iirc fb doesnt even use graphql-js much? the whole thing was embedded in PHP Ents . graphql-js was written purely for the opensourcing. ofc things may have changed somewhat in recent years but i doubt if


Is PHP Ents a type of webserver FB uses?


Ent is a (proprietary) database wrapper library resembling an ORM,* but I don’t believe Facebook’s GraphQL server infrastructure is tied to Ent at all.

* Though Ent’s creators emphatically claim it is not an ORM, that phrase does give you a rough but directionally-accurate idea of what it is.


ah ok. sorry for potential misinformation, i glean only what i can from being a casual outside observer.


No I heard they still chew a lot of bubblegum over there.


Reference implementation almost exclusively refers to behaviour and not performance.

For example reference implementations of JSRs have very very rarely been usable/used in production.


> graphql-js is the reference implementation of GraphQL, so it's not any random library.

I always expect the reference library to be optimized for correctness, formal proofing, brevity, simplicity and clarity, not (compilation nor execution) speed.


I hear your point and it's true that you can't point the finger at GraphQL qua GraphQL, but that's sort of by definition since gql in itself is, as you say, just a spec. But, and I'm not saying this is the case, if every implementation of gql has major issues, it's not really fair to say "but but the spec itself is fine". As programmers, we aren't working with the spec.


It's technically possible to achieve the result op wants in sub millisecond time. However I'm not sure if Node.JS is the right stack for this. You'll have to carefully avoid object creation and reduce garbage collection overhead.


Reading the thread, this isn't a "dark side of GraphQL" but a "dark side of not understanding how to debug/improve performance in my software dependency".


Not sure why you are getting downvoted. The person actually states that they don't know how to debug, "honestly, I'm not 100% sure the best way to debug from here." They are just looking at Datadog stats and not finding the root cause. They could do some basic JS debugging of the open source library to figure out the issue. Blaming Apollo would be a stretch (which may not even be the issue since they haven't done any debugging), but the protocol of graphql is way too far.


> Not sure why you are getting downvoted.

I don't know why anyone downvotes as they do, but the previous post is an irrelevant argument about semantics, so in my opinion it deserves to be downvoted.

Actually, now that I think of it, it's a little worse than that. The OP is being criticized for not understanding how to debug or improve the performance of their dependency while actively engaged in figuring out how to debug and improve the performance of their dependency. (People respond with questions, OP provides substantive answers, there's a back-and-forth and OP forms an idea it's related to a deeply nested schema, and so on.)


> The OP is being criticized for not understanding how to debug or improve the performance of their dependency…

Nothing I said is a critique of Ben Awad, the Twitter OP. I assume we've all used dependencies that we don't completely understand, no?

To rephrase my point: The dark side of semi-opaque dependencies like ORMs, application frameworks, etc. is that when the magic doesn't work, one may not be able to easily determine where to start in order to address issues.

EDIT: Removed unnecessary paragaph to simplify response.


The person who wrote the “creative” title here on HN for the Twitter url they submitted is not the same person having the issue with performance and posting for help on Twitter.


Ben Awad is pretty knowledgable in graphql. He is one of the most solid youtube tutorial guys on it.


Having youtube tutorials on something does not make you an expert on the subject. He appears to not have the skills to do basic JS debugging. It's great to ask for help, everyone needs help at some point. The issue I see is starting with "The dark side of GraphQL". If you haven't found the actual issue, how do you even know what the cause is? Just because your SQL query is fast doesn't mean there is some inherent problem in GraphQL or Apollo. That argument doesn't follow. It could be user error.


I don't mean to sound like a blind witness fanboy, but I think teaching something effectively does require some mastery of the subject.

I also don't think it is hyperbolic say he doesn't have basic debugging skills. What qualifies as basic debugging skills? Like he isn't capable of using a debugger and introspecting code? He can't use a print statement and look at code? Debugging an E2E bottleneck is not trivial.


I have no doubt. The unfortunate bit is that his tweet uses "GraphQL" to refer to a specific implementation (Apollo Server) of GraphQL rather than GraphQL itself.


But Apollo Server is by far the most common implementation of GraphQL servers, and the OP's thesis based on the twitter thread is that type-checking and validation are responsible for the slowdown, and type-checking and validation are inherent to GraphQL.


It would be like posting "The dark side of SQL" for a slow MySQL query


But there are dark sides to using sql, often from the abstraction that sql provides.

Maybe the optimizer picks a poor plan and you can't figure out how to make it work better. Maybe the schema has redundancy you can't change or the indexes aren't suitable for that query. Maybe it's auto parameterizing constants and the query with the problem has a parameter causing different behavior than the original constant used in optimizing the query, or maybe your query with 1000 elements in an in list worked great in memsql or whatever but is slow unexpectedly in the database you ported your app to. There are downsides to everything.


I'm not sure if this is mentioned in the thread, but one of the reason it takes so long for the requests to return is when GQL initializes the entire record in memory and then reduces it back to only the fields you wanted. This can be a big problem if you have a deeply nested data model, and potentially many results. The memory consumption can hit the roof. I find that the best approach in those cases is to create a one-off REST endpoint (or to create a field higher up the GQL hierarchy) and handroll the SQL query.


> when GQL initializes the entire record in memory

GQL is an idea not an implementation. I don't believe there's anything preventing actual software from optimising this case. Or am I missing something here? The query defines what you're asking for so extra data does not necessarily need to be fetched.


You're correct. It's common for GraphQL API implementations to batch all the parent record's fields up front, but it's not the only way. One alternative method is to traverse the whole query object and generate one big query to your database (SQL, graph database, what have you) instead of batching queries per table/object. This has trade-offs. Sometimes it's more performant, especially for smaller queries, but for larger queries it can actually be slower because joining lots of tables into one query causes some duplicated data and transfer overhead (assuming you're using SQL). I have a feeling that this method would perform very well if your GraphQL data was backed by an actual graph database though.


I'm positive there's a way to only load the relevant data in your stack.

In Elixir with Absinthe we can resolve to the specific fields we need and we don't load the entire records then slim down.


I never used Absinthe, but if you're initializing an ORM in your resolver, loading the entire record into memory is unavoidable. How does Absinthe get around that? (Sounds like it generates the SQL?)


Absinthe does some dumb things by default too. You need to manually optimize your resolvers.

One annoying thing Absinthe did by default I noticed was fetching entire object from DB, even though the GraphQL query only returns it's ID. For example query below would fetch each person from a DB, even though we already had a list of person IDs on the friends level:

  {
    friends {
      person {
        id
      }
    }
  }


you point it to a resolver, in your resolver your resolve, in other words write the query


I thought it could have been a memory problem too, but VPS didn't show any signs of anything spiking https://twitter.com/benawad/status/1212404379371917313

but I do think it's related to my nested object https://twitter.com/benawad/status/1212407236284338176


Things have matured quite a bit. With Apollo Server it's possible to fully understand which fields are being requested before creating and running, for example, and SQL query. Fetching only the requested data for a given query reduces in-memory footprint. Most people get the whole data object and then allow GQL to select the subset of fields the user asked for, but for cases where performance is a problem there is another solution.


I haven't used Apollo Server lately. But the way you describe it doesn't address the core issue, which is the initialization of the intermediate objects in-memory. So just to give an example, if I wanted to query for the projects of listings of my company, I can write it this way in GQL:

   me { company { listings { projects { id name } } } }
This will initialize: a User, a Company, Listings and Projects of all listings.

I can also write this in SQL using a couple of joins and return an array. The memory consumption is trivial in comparison to the original request.


You can implement your GraphQL server to do either of those, it's not inherent.


You say "problem there is another solution" - what is the other solution? (I'm guessing it involves somehow telling Apollo Server which fields/related objects you will need?)


How deeply do you mean "entire record"? I am pretty sure that this concern only applies to those fields which have the same resolver


It's also possible to look at what fields were requested in the GraphQL query and use them to aid what gets fetched.


I guess I still don't quite get most GraphQL designs or why a lot of people jump to implement it... I have always thought the big idea behind GraphQL is you already have endpoints (likely rest) which are cached/optimized? And then GraphQL becomes a layer over the top to map/reduce client requests for more optimized request/response cycles for clients (and I guess decouple some business logic)?

Which then makes me wonder why this is the "dark side" of GraphQL? Isn't this just not optimizing a query somehow or using a cache effectively? Is it really the nature of GraphQL that's causing this to be slow or just programmer error [1]?

I've used GraphQL in production services as an alternative to a rest endpoint (which I didn't care for) and I don't think sheer the nature of the validations ever caused that much slow down or rather, more plainly put that GraphQLs design would not necessarily cause such poor performance on such a small set of data.

/shrug I dunno, if this were me, I'd just assume I had written a bad query or validation somewhere. And to be fair to the author, they only made a post on twitter to reflect on their problem, not to say GQL has a dark side (at least from what I read in the thread).

[1] We all have made programmer errors, are likely making some "now," and will for sure make more in the future. No reason to feel bad about it, we're all human :) Mistakes are just a part of life no matter how "good" at things we are.


From the thread.

'Slow response times for large documents'

https://github.com/graphql/graphql-js/issues/723#issuecommen...

It seems that the graphql library is performing a lot of validation, and that's slowing things down. I expect validation to be a pure-compute task, and this is Javascript, so I suspect this is really a "working with large amounts of data in Javascript is slow" issue - but that's just at a glance.


Sounds to me like an issue that comes with coupling of validation with serialization. A lot of these API frameworks combine the two, with a the goal of automating validation when receiving data from clients, but then also do that validation when serializing response data, which should already be validated if it's sitting in your database.

I've ran into similar issues with FastAPI and DRF when dealing with really large payloads.


Quite strange, GQL server source code is literally just walking by fields and resolving promises, very simple and straightforward.

We had something like this in our backend, but this long times is usually meant that something wastes event loop and just blocks everything from execution.

It could be anything for example it could be async hooks that makes ~1000 times slower if you are using a lot of promises (since resolving fields often are just promises) since overhead is per promise. In general in latest nodejs you can do huge amount of promises and they have little to no overhead, but, again - something wrong with nodejs setup, some library populate event loop or something deeper in nodejs internals. It is not an issue with gql itself since if you have gql performance issues that means that your server is super slow in processing like anything. Our team was shocked by performance and it turns out that NodeJS is super fast and it is some libraries (like sequelize) that kills the performance, but gql is not one of them.


I’ve seen similar issues in graphql-ruby. Even if I hardcode the data in my resolvers, it takes hundreds to thousands of ms to render a list with some moderate nesting.


I've built fairly complex GQL backends using CPython + Graphene and never seen something like this - if we had slowness it was because we were yet to implement dataloader in some places.


We faced similar issues at Zalando when trying to use Graphql at scale and to mitigate this we built https://github.com/zalando-incubator/graphql-jit. Try it for your usecase and let us know how it affects the performance


Checkout Super Graph it's a GraphQL to SQL compiler and API service in GO. In production mode it uses prepared statments so no compiling hence very latency. https://github.com/dosco/super-graph


Forgive me if this sounds a bit “hindsight 20/20”, but I feel like performance was always a lower consideration when it came to utilizing graphql. The win is in reducing overhead around providing new endpoints.

Like react, it eschews performance for the sake of enterprise level scaling. This shouldn’t come as a surprise to anyone, being both of these came from one of the largest dev organizations in the world.


Eschew performance isn't the right way to put it. React allows you to do 90% of UI work in performant ways. It has good predictable performance for the majority of work and allows you to move through a lot of simple UI tasks quickly. And spend time focusing on the performance in parts of your app that matter. The situations where you really need performance tuning are going to be unique to your specific app and data.


> performance was always a lower consideration when it came to utilizing graphql

That's strange, because I thought the main selling point was to consume only the data you need. The client specifies exactly which fields it wants. Then it doesn't over-fetch. To make things higher performance.


GraphQL the protocol/language was designed for performance, but (when I tried GraphQL, which was several years ago) the server-side implementations seem to have had much less of a focus on it.

It's true that the client doesn't over-fetch (and also doesn't need multiple round-trips), but at least when I tried the gql-js library it required the server to over-fetch: it would ask for individual records, and then do the field plucking/record joining itself; there was no way to intercept the query along the way to find out which fields it needed so you could only fetch those.

I get the impression that the server libraries were designed to work with a document store or "fat" REST API that is only capable of taking a single ID and returning the entire record. In this situation it makes sense to have a separate middleware server to keep the big fetches and round-trips inside the datacenter and only give the client exactly what it needs, and needing a little more server power isn't a big deal. But, if you need to do something more sophisticated (even something as simple as only fetching certain fields from the datastore), they were no help whatsoever; when I was looking into it there wasn't even a way to parse the query into an AST and do the rest of the query planning yourself.


Echoing this, GraphQL is just a specification, and it is up to library authors how that spec is implemented.

I think there might be a disconnect or misunderstanding in the developer about this. GraphQL is sort of like the Flux pattern for MVVM architecture. It isn't so much a thing as an idea.


The client doesn't over-fetch, but the server might need to over-fetch from the db with generic sub-optimal queries to resolve all the fields.

Your mileage may vary, but on the project at work where we tried to utilize GQL, it became apparant that it sometimes is incredibly complex to map between a GQL query and an efficient db-query.

We're slowly migrating to REST for preformance's sake. The front-end devs might complain since they can't just write a query to get exactly what they want, but I'm quite sure our customers won't complain about 50KB being served in 30ms instead of 10KB being served in 400ms.


from my experience, you might not need all the data in the first fetch but it is highly likely that it will be needed for subsequent renderings. What worked better for me is to make sure the client doesn't ask for the same data (ID) multiple times, client-side caching works wonders in that case


While it’s certainly true performance can be a trade off... 400ms+ response times are annoyingly slow. I’m not sure a trade off is worth it unless it’s some really exotic endpoint you’ve created


without the actual code, this is as far as we can debug. can the author create a small reproducible repo instead?


any chance to reproduce and post the HAR file? thanks


Hmmmm... if anything the performance of a graphql query should generally outshine REST in nearly any category of performance. From the sound of things, the performance issue doesn't make any sense. He's using Dataloader, and he is certain it's not related to dataloader anyway. So maybe some dependency he's using is the wrong version.


Can you elaborate? I find it hard to believe that you can't build a REST API that's faster than GraphQL given all of the bells and whistles that GraphQL tacks on and that you could hand write the perfectly optimized REST endpoint. What am I missing?


REST can definitely be faster than GraphQL, because as you mentioned GraphQL is doing much more.

But as you start to re-use REST APIs across multiple pages/apps, you often end up:

- over-fetching: you'll receive some data that you don't actually need, but that it's required by another page

- over-querying: the data you are over-fetching might have required extra queries (or even worse calls to an external system)

- cascading requests: if you are working with nested data you might have to call the server multiple times, often in a sequential manner

Also in my experience the performance of REST APIs tends to get worse over time, because as many developers work over the same APIs it's almost unavoidable to keep these APIs lean. Just before Christmas we had to spend some time figuring out why a fairly simple GraphQL request was taking almost 2 seconds. Turns out a developer had accidentally introduced an n+1 in the underlying REST API. The n+1 was on a field/relationship that we didn't need/use, but with REST you don't usually get to pick what you want to load so...

Solving these problems with REST is possible but not trivial, while GraphQL mostly solves them out of the box. So while REST could be faster than GraphQL, in my experience it usually ends up being slower.


I think this gets to the underlying criticisms. GraphQL is not the replacement for REST and by nature an improvement - it is an alternative to REST based on a certain set of pain points. If you’re a large org with multiple consumers and your API development is silo’ed, then GraphQL may solve a set of problems you have that a small homogeneous team may not.


Correct, but it's worth noting that GraphQL has much more to offer than just request performance. I'd argue that performance improvements is the less interesting part of GraphQL.

The main selling points for me are:

- speed of development: once you have a complete graph, you can add new pages/features at a fraction of the time, and often without touching the backend at all

- type safety: you can generate typescript/flow types for your queries, giving you type safety from db to client (assuming your backend has types)

- query co-location: you can have the query (or a fragment) inside or next the component that uses it. Need a new field in a specific component? Just update the fragment, any page that includes it will get it by automatically

and last but not least developer experience. Having worked for a few years using graphql + apollo + react + typescript (on both a personal and a reasonably famous large website) I can honestly say that it feels like living in the future. As a mostly backend developer, I have never enjoyed working in the frontend so much.


Well said. That's why my current team started using graphql. It actually had very little to do with performance or anything backend-related, and mostly because of the same reasons (FE speed of development, type checking, etc).


Would you mind sharing your email? You have a fantastic way to build a message and we are looking for devs with that kind of skills.


If you don't want to share it publicly, you can find me very easily over the internet under the same name


Thanks for that follow up!


> Hmmmm... if anything the performance of a graphql query should generally outshine REST in nearly any category of performance.

I'd love an explanation for why you'd think this to be generally true, and for any category.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: