Hacker Newsnew | past | comments | ask | show | jobs | submit | more moxious's commentslogin

Many of us who have worked in the graph database space have been tempted to use the graph abstractions on top of a relational database. It's a reasonable first step. Tables are nodes, join tables are relationships and so forth.

It's also however a bit of a dead end once you go beyond the basics. The costs of joins get worse the deeper you go, and "hundreds of milliseconds" is at least an order of magnitude slower than what Neo4j would do for you.

Once you take that major performance penalty, and then layer it into more complex graph algorithms or analytics, it gets really, really painful quickly. Granted, you might not notice this if you never needed to go further than 2-3 hops in a graph. But once you start working with graphs you're not going to want to stick to such basics.

More technical detail on the difference between a graph abstraction on top of another database, and a native graph database, can be found here:

https://neo4j.com/blog/note-native-graph-databases/


Well, of course it would be expensive to use tables as an arc.

So say A points to B you have 3 tables right? Table 1 for A, Table(2) for B and a join table (3) showing that A points to B right? Why would you do that? What's stopping you from having one table that contains A and what A points to? So you only have 2 tables?

What if you have a node that can point to many items, a column can contain a list in postgres, so we can still have one Table containing your node data and a list of items they point to.

I'll concede that graph databases are easier to write query for, most people already struggle with basic SQL, let alone CTE and recursive CTE.

I'm yet to be convinced that a problem can't be reshaped and mapped on a traditional RDMS and yet remain performant.


> What's stopping you from having one table that contains A and what A points to? So you only have 2 tables?

So you can do that. Suppose you denormalize the graph into a single table. Either your denormalizing, and to connect A to many B's you have to duplicate A's data in the table, or you have the constraint that the A -> B link can only have a cardinality of 1.

This would be not a good choice for nodes with many relationships to other things, say for example "Person friended Person". It might work if the cardinality was somewhat capped, like say for example "Customer ordered Product" (a common denormalization).

For a real application, you're not going to have 1 of these. You're going to end up with 20+.

> I'm yet to be convinced that a problem can't be reshaped and mapped on a traditional RDMS and yet remain performant.

Any problem can be re-shaped to any database formalism. For that matter we can re-shape everything we're discussing for a straight K/V store like Redis. The expressive power of a database isn't at issue, because any database can store any dataset. Period.

More relevant questions are about performance and about conceptual fit for the problem. First on performance, have a look at the performance growth graph here:

https://neo4j.com/blog/oracle-rdbms-neo4j-fully-sync-data/

If you reason from the computer science of how these systems work, this result makes sense.

From the conceptual simplicity standpoint, that's kind of a matter of personal taste and application. Can you do it all with an RDBMS? Sure. But all of these different tech niches exist because sometimes you want more than one kind of tool for the wide variety of jobs you need to accomplish.

I'd argue that it's conceptually simpler to think of your graph as nodes and relationships, rather than to remember each time which node/rel set was denormalized into one table, which node label was split out into its own separate table, what that join table was, how the key naming differed between tables, etc. etc. etc. (Because once you have a non-trivial sized graph, you'll have a lot of these, and maybe you made different decisions at different spots).

The prize if you do remember all of that is that you get to write quite complex SQL to join together the datastructures for non-trivial traversals, because substantial graph use cases that you can answer on the basis of a single denormalized table are going to be rare.

In terms of conceptual simplicity, compare a recursive join SQL query to a cypher snippet like "MATCH (user:User {login:'bob'})-[:KNOWS*..5]->(foaf)". The equivalent SQL is...difficult.

It's just a "use the right tool for the job" situation at its core.


$100M is likely a placeholder amount. They're filing to go public and can adjust that figure later, it's not that strange to do so. I don't think $100M means much there, and relative to their overall valuation if $100M is all they wanted they'd likely have other options.


Wanna thread other known attempts at this? Maybe someone will jump in with extra detail about how these approaches are different, or what extra value we'd expect.

- OpenCyc http://www.cyc.com/opencyc/ - DBPedia https://wiki.dbpedia.org/


Founder here. OpenCyc (as well as Freebase) are human attempts to enter and curate a structured knowledge base. Likewise DBPedia is a set of scripts that extract Wikipedia infoboxes (semi-structured data which is also human crowd-sourced).

The Diffbot Knowledge Graph is built by applying computer vision and natural language processing techniques to reading all the pages on the web (which can be in any structure and human language) and extracting it into a structured form, without the element of human annotation in the build pipeline.


Can you expand on major points of how this will make the content different, (for example, Wikipedia is curated and non-notable people pages get thrown out, so if you're reading all of the web, presumably you'd know about non-notable people) -- and why it's better?


Founder here. There are many differences in the result when you have an automated system building a Knowledge Graph vs. a human one.

Obvious one is scale, Wikipedia has on order 10M entities and represents the work of thousands of humans whereas the Diffbot KG has 10B entities and is discovering about 120M each day, and is largely limited by the number of machines running the algorithms in the datacenter. The properties and facts indexed about each entity are also a superset because it is not limited to those that would be worthwhile for a human to curate. Lastly, it can be more accurate than facts found in a single source because the automated system utilizes multiple sources of that fact found across the web to estimate a probability of the accuracy of the fact.

The result is that you have a Knowledge Graph that is more useful for work and business because they are the entities you interact with day to day, not the "head" entities that optimize for popularity and the constraints of human curation.


Fantastic question. A major component of a machine generated ontology has to be a notability score, otherwise it would be practically impossible to store all entities (and their relationships).

Further, for this to scale, Diffbot has to have a way to align their entity IDs with IDs from other notable graphs like Wikipedia, Wikidata, Freebase, Wordnet or even Yelp, and the like, otherwise the data could be potential of diminished value.

How would I know that the "Cardi B" that's in my database with ID 321 and wikidata ID Q29033668 is the same as Diffbot's "Cardi B" with ID 561?


I've been the CTO at two tech startups. Here's how I think about it.

"Debt" is a good metaphor because by accumulating it, you get advantages right now in speed and ability to focus on other things, in exchange for having to pay it off later. You can either pay it off by stopping what you're doing (uncommon) and fixing stuff, or you can do lifetime installment payments of more complexity, bugs, etc.

Tech debt is just one aspect a CTO has to manage. Time to market is pretty damn important. Features are important. Many other things too. This is not to say that tech debt should be ignored, but elegance of the system is not the paramount concern. Just as startups borrow money and take financing to move quickly, so too a responsible amount of tech debt can be a good thing.

In engineering we tend to over-emphasize how important the technical qualities of the system are because that's what we see every day. There are a great many things that are also important, and various business scenarios where they become MORE important than tech debt. But if you're in the stage of trying to find product-market fit, you probably have bigger fish to fry.

The biggest tech debt I've struggled with always comes from the same patterned source. You think a system is intended to have a certain feature set for a certain user base, and so you make a set of architectural decisions to match. The world moves on quickly, you adapt, pivot, whatever you want to call it, and suddenly your carefully made architectural decisions are no longer correct or wise for your new scenario, but there is no time to go back and have a "do-over".

So in essence, failure to be able to tell the future, paired with a constant need to move forward, is the source of the problem. Both of those things are going to keep happening to startups forever.


In other words, it’s not a problem to be avoided, but rather a problem to be embraced.

If this is indeed true, how does one fully embrace it? Optimize for easy refactoring? What does this look like in practice?


No, I wouldn't put it that way. It's not a problem to be embraced, it's just another variable to manage.

Just as there is the trilemma of "better, faster, cheaper: pick 2" there are others as well. Speed and tech debt are balanced against one another. It's not that I think we should embrace tech debt. I actually rather hate it. But sometimes I'm willing to trade to get to a business objective that is not related to technical elegance.

In terms of how to pay it off, it may sound like a cop-out but it just depends. The core problem is that things are always changing and we can't predict the future. So staying supple and flexible I think is the way to go. Pay off as much of it as you can when you can, but I think the main point for me is to keep the eyes on some business objective (where tech debt appears as only one variable in the equation) -- the main point is not to keep your eyes on the tech debt all the time.


I’ll give my perspective (CTO at a small start-up), and I’ll start by saying that I fully agree with the GP.

Technical debt for us manifests as new cards in the backlog (and/or TODO comments in the code), because when we cut corners we usually have the presence of mind to flag it. I find that as long as these cards don’t rise to the top of the backlog, then it’s safe to ignore them. It means that you can afford to pay the interest of that technical debt.

If however you find yourself repeatedly thinking “To do this feature I really need to fix this issue first”, well then it might be time to repay some of the principal, so to speak.

In practice, the first time we build a new type of feature, we’ll do it quickly and dirtily and accept that it may be throwaway code. If we end up doing the same thing a second or third time, then we’ll refactor. But then we’ll know there’s a need to do it cleanly, and we’ll have two or three examples from which to try to generalize what the architecture should be. On the other hand, we really try to avoid over-engineering and premature optimization. If it’s good enough for Donald Knuth...


It is a tool to leverage when appropriate.

You can write tests when prototyping or you can borrow a little tech debt and skip the tests during the prototyping phase. At this stage it is like a line of credit.

Then, as the feature/product starts to take shape, you can pay down the debt by writing the tests and refactoring the code to be simpler. This would paying off the line of credit.

Or, you can roll the line of credit into a term loan and move on. This will increase your long term debt.


I'm not in a startup, but I always try to budget ($/time) for 20% overhead to undo stuff technical debt. When/where that isn't possible, it just gets harder and harder to fix.

These issues are "perfect is the enemy of the good" scenarios. If you don't have "technical debt" or something you don't like or got 80% right, that is a far bigger problem. How to manage really depends on your scenario.


This is where most of what 'debt' i manage comes from. We have a very large ecosystem of interconnected systems many of which have roots 15 years in the making. While the design decisions made back then were valid the expectations of our clients and industry have of course evolved. We're a small team and it's impossible to stop and rebuild the system so I try to manage it now by building services we can plug into the various systems. It helps reduce the scope of future debt and often allows updates without the stop and rewrite nightmare. Doesn't work for every case but it has helped and hopefully whomever I pass this on to will find it easier to manage.


This sound very familiar, the software development company I work for uses at least 10 different platforms internally across Linux, Windows & Unix...SQL server, mysql & oracle dbs. 10yr old PHP customer support systems, 8 yr old jboss & tomcat servers with deserialization vulnerabilities at every turn, .NET V3 testing tools, mixed up with a sprinkling of modern platforms all talking to each other through rest, direct DB connections, microservices & a hearty dose of black magic. Management ignores vulnerability reports due to mostly non-technical backgrounds. Most documentation is at least 3yrs old, when 80% of dev team sacked & outsourced to India. Mostly keeping the lights on now during an acquisition...can't wait to debrief the new owners. Have acquired some great experience in investigating technical debt in the process which will hopefully be useful in the future.


Technical debt is less analogous to financial debt, and more like an anchor. You want to move fast to keep up with the market, but your debt slows you down.


That sounds exactly like financial debt to me. Too much (financial or technical debt) is an anchor, healthy amounts of debt result in not being able to do everything / having some limitations but overall you're in a good place, and no debt probably means it's time to actively pursue opportunities. No?


With technical debt, each new feature now costs $nx where n is the multiplier for the accrued technical debt. A feature that you need in 2 months to support a market shift now needs 4 months to complete due to technical debt.

The issue with this oversimplified formula is that you cant accurately determine which debt affects which features. For some features it could be zero, and others it could be 100.

However, I do agree that all teams should be carrying an amount of technical debt to be healthy. It shows a certain quality of decision making to balance it well.

All too often though it becomes an excuse to procrastinate and that might as well be gambling.


I see this as interest on debt. When you have $1K of credit card debt, a $500 payment goes a long way. When you have $20,000, that same $500 is mostly covering interest, so you pay down the principal slower.

The additional complexity and work that comes with tech debt is the interest you pay on it.


Not really, I have seen whole features dropped. Like payment processing, business had an idea to use stripe, 6 month later all code deleted because our business model was not for people who would pay with stripe.

We also pivoted other application where only database stayed roughly the same. Loads of legacy stuff was still there but it is going to be phased out soon.

So I have seen situations where tech debt was never paid back. I have also seen one guy that is not paying me back my money, but I still remember he owes me. Some tech debt will go into oblivion in next year or two...


That anchor effect of being slow is what I meant by "installment payments". Financially, if you have enough debt you can make $300k a year and still live in a crappy apartment, because all of your income (in this weird metaphor, that's your available dev hours) are being spent on debt (fixing things that wouldn't have occurred had you not had the debt)


I will concede some points in your favor for that. I will counter with the fact that almost no CTO will change there deadlines accordingly, which leads to team burnout/turnover.


Why so much faith in non profits? If they have some specia sauce I’m not sure what it is. They still require the prisoners in order to exist and grow. And economically there’s this old joke...they’re not for profit, but they’re not for loss either.


> Why so much faith in non profits?

Perverse incentives. If a company's profit is linked to higher incarceration, they are incentivized to make the problem worse, not better.

This doesn't make all non-profits magical paragons of virtue, or all for-profits evil. It is simply an indicator.


I've heard this one before when discussing a non profit that was thinking of shopping itself around for an acquisition: "I thought they were not-for-profit?" ... "Well, yeah, but they're also not-for-going-out-of-business."


>Why so much faith in non profits?

I don't have that much faith in non-profits. But I don't have any faith in a for-profit company for social good.


Some differences from my perspective:

- SOA/ESB in the 2000s was driven more by business, less by available tech. As an example of what I mean here, SOA folks would usually talk about decomposing business process. E.g. Amazon sells something, there's a "shopping cart" service, a "payment" service, and a "delivery" service. (I'm simplifying, but you get the point) Modern microservies are way more granular than that, and it could be that you have 5 tech-oriented services that compose to a single sub-step of a business process. The original SOA idea had the concept right, but was (IMHO) still too coarse to make the idea work.

- Several enabling techs that came since then (containerization being an example key one) didn't exist then, which made doing the same thing 5x more painful. So the tech just grew up.

- On the data side, SOA/ESB was driven by XML and XML Schema. As someone who practiced a lot of that, it was really painful to use a document markup language to do structured data exchange. XML was popular because in the age of proprietary formats it was the first open, text-based, non-license encumbered format. So don't get me wrong, there wasn't anything better at the time, but that didn't make XML actually good for the job. Note this is not a comment that JSON is better. In 2018 we're in a world where open data standards are the default. So it's not XML vs. JSON. It's XML vs. the entire rest of the world you have so many options.

- Performance improved, how many years of Moore's law, storage & memory in between? This may not seem like a big deal, but it is. Since 2000 we gained so much in computing power, that we can afford to slather on another 10 layers of abstraction to make microservices easier on ourselves.

- Software support improved. Go back to 2001 and scaffold a java app that worked with WSDL and SOAP. Then go check out 2018's serverless framework and scaffold a node.js serverless function. Back in 2001 you may have been manually downloading JARs and putting them in a lib folder, then checking that into cvs. In 2018 it's yarn install whatever, saving dependency structure (but not binaries) in git. This amounts to hours of extra work the developer is no longer doing. Greg Lemond, the famous cyclist, was inadvertently talking about software when he said: "It never gets easier, you just go faster".

- Infrastructure improved. As with the previous point, there are hours of work you're not doing, that lets you focus on your microservice. For example, almost no one who writes a microservice admins the server it runs on. Why would you take such care to admin a single box? Just execute it and spin up another. The average developers of 2001 would be mind-blown.

...But some things stay the same....namely there's nothing free in this world, and everything's a tradeoff.

- Service decomposition to the right level is still really tricky and people regularly screw it up.

- Being able to write your system in 10 different languages is nice for flexibility of hiring, speed, and team independence. But then of course you have to maintain 10 different languages worth of software.

- We've traded debugging of simple stack traces in monoliths for ultra-complex network inspection setups where we debug the failure to pass a parameter to a remote function through a nightmare of complexity -- through the tech stacks of both services, through the networking layer, through the containerization layer, through the orchestration layer, etc. etc. Monoliths were not without their charms.


I see what you mean. And yet strangely, the need for that "Friendship" node can also be seen as a strength. How else would you assert metadata about that thing otherwise?

If A and B share an event via a friendship, you can keep track of things about that. Granted in an RDBMS if all you wanted was to draw the line then you could do it with an extra FK, but I think the conclusion you're drawing is going too far, specifically:

> In SQL that wouldn't have been a remodel, because there's no difference between a relationship and an entity

There is a difference in SQL; relationships are EITHER more columns, OR a join to another table, both are possible. In graphs, "hyper relationships" (e.g. relating more than 2 things) require another node, but this is apples/oranges comparison.


I probably should've clarified that I was talking about n:m-relationships. And for that case I don't see how it would've been an apples/oranges comparison.


While this is technically true, in the SQL world this requires wizard level skills that most SQL developers do not possess, and when you arrive at this spot, you end up with a query that performs really, really badly.

Look, between the database formalisms, they're all "complete" in the sense that you can choose any database and solve all the problems. But certain databases are going to be pathologically bad at solving certain types of problems, which is why there are so many sub-niches that persist over time.

For deep path traversals, you can do it with RDBMS, but a graph DB is going to win every time in part because the data structure is just set up for that purpose. There are other queries where RDBMS will be best too. So it goes.


Graph databases are built a lot of different ways; for example Neo4j's architecture is very, very different than something like an RDF triple store, or datastax on top of cassandra.

[Neo4j internals can be seen here](https://www.slideshare.net/thobe/an-overview-of-neo4j-intern...)...it's a bit old but I think mostly still accurate.

In graphs you have to persist nodes and edges, though you may partition nodes by label/category. In the case of neo4j there is a property store rather than a set of columns.


Thanks, very helpful. I am just looking at it and will have a bit of a think about this later :)


The media has reported on some fairly outrageous cultural excesses. Question -- did you experience these when Kalanick was there? Since Dara showed up, anything specific you can say about what was done and whether it "feels different" and how?

In some places the CEO is a distant figurehead, and whatever his or her values, the day-to-day experience is more dominated by your local crew of < 20 people. In other places, culture is truly pervasive. I feel like the reporting is saying that culture is pervasive at Uber, but it would be interesting to hear an inside perspective.


Both? I experienced nothing anywhere close to the egregiousness mentioned in the media. But TKs ruthlessness, long work hours ethics was always omnipresent. It wasn’t so much top down mandated to work hard but the culture was there. Paradoxically managers at times would have to push the employees to take breaks, vacations etc. I personally relished that environment but understandable how those would other commitments would find it hard. Our compensation structure also encouraged working hard (and all the stress, burnouts and disappointments that stemmed from it) so even if an employee had all the freedom to leave work at 5 pm and do other things, unless you were ultra-efficient in the 8-9 hours at work (which some of the best engineers I worked with were), there was always a risk of missed incentives. In short, it felt like a great company for those who could manage these trade offs. For all others, it was stressful.


What do you mean by cultural excesses?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: