Without knowing more on the specific needs (I followed some of the threads to try to grok it), it would be hard to guess what they really need.
[commercial alert]
My company (Scalable Informatics) literally builds very high performance Ceph (and other) appliances, specifically for people with huge data flow/load/performance needs.
We've even got some very nice SSD and NVM units, the latter starting around $1USD/GB.
[end commercial alert]
I noticed the 10k RPM drives ... really, drop them and go with SSDs if possible. You won't regret it.
Someone suggested underprovisioned 850 EVO. Our strong recommendation is against this, based upon our experience with Ceph, distributed storage, and consumer SSDs. You will be sorry if you go that route, as you will lose journals/MDS or whatever you put on there.
Additionally, I saw a thread about using RAIDs underneath. Just ... don't. Ceph doesn't like this ... or better, won't be able to make as effective use of it. Use the raw devices.
Depending upon the IOP needs (multi-tenant/massive client systems usually devolve into a DDoS against your storage platforms anyway), we'd probably recommend a number of specific SSD variations at various levels.
The systems we build are generally for people doing large scale genomics and financial processing (think thousands of cores hitting storage over 40-100Gb networks, where latency matters, and sustained performance needs to always be high). We do this with disk, flash, and NVMe.
I am at landman _at_ the company name above with no spaces, and a dot com at that end .
Thanks for the shout out! Without being a commercial, this high performance storage/analytics realm is what we focus on. We were demoing units like this: https://scalability.org/images/30GBps.png for years now. Insanely fast, tastes great, less filling. Our NVM versions are pretty awesome as well.
[edit]
I should point out that we build systems that the OP wants ... they likely don't know about us as we are a small company ...
I am setting a unit up now for a customer. In the years I've been working on them, the toolchain you need has not gotten much better ... no ... it went from complete crap to meh.
You really can't not use the intel compilers for them. Gcc won't optimize well for it at all. This means that in addition to the higher entry price for the hardware, you have a sometimes painfully incompatible compiler toolchain to buy as well, and then code for. Which means you have to adapt your code to the vagaries of this compiler relative to gcc. These adaptations are sometimes non-trivial.
I am not saying gcc is the be-all/end-all, but it is a useful starting point. One that the competitive technology works very well with.
From the system administration side, honestly, the way they've built it is a royal pain. Trying to turn this into something workable for my customer has taken many hours. This customer saw many cores as the shiny they wanted. And now my job is (after trying to convince them that there were other easier ways of accomplishing their real goals) to make this work in an acceptable manner.
The tool chain is different than the substrate system. You can't simply copy binaries over (and I know they have not yet internalized this). The debugging and running processes are different. The access to storage is different. The connection to networks is different.
What I am getting to is that there too many deltas over what a user would normally want.
It is not a bad technology. It is just not a product.
That is, it isn't finished. It's not unlike an Ikea furniture SKU. Lots of assembly required. And you may be surprised by the effort to get something meaningful out of it.
As someone else mentioned elsewhere in the responses, the price (and all the other hidden costs) are quite high relative to the competition ... and their toolchain stack is far simpler/more complete.
The hardware isn't a failure. The software toolchain is IMO.
Mine was in theoretical physics. I liked my advisor and the work. But she wasn't a go-getter on grant writing, or on paper writing.
Many of my friends left and went to medical physics. I thought of it, but decided against it.
This was 20+ years ago.
In hindsight, it was a mistake (for me) not to do this.
First off: There is no shortage of PhDs to fill tenure track ranks, and there are precious few tenure track ranks available. Unless you are at a top school, with a really special advisor, and your work is getting attention, you should assume that academe is not in your future.
Yes, I know this is harsh. I wish someone had said these words to me in 1992-1994.
Second: The magic of a PhD is that it is a union card for some jobs, and a good way to open other doors. But know that the time you spend doing it, is time you spend away from economic productive life.
Has my now 20+ year old PhD been a boon to my career? I don't know how to answer this. It conveys (via a mal-advised appeal to assumed authority) some intelligence. And maybe it gets me better bargaining ability. Or maybe not. Maybe it typecasts me. I don't have enough data to know.
What I do know is, given the benefit of hindsight, I would have done very different things.
Third: You always have options, and you always have choices. You need to ask, and answer, a very hard question for yourself. Specifically, what do you want to do with your life? And even if you don't have a hard answer, which is fine, a roughly general sense of what you want to do is a good thing.
I decided after seeing the FSU collapse, and the market flooded with fairly senior FSU physicists, that I would focus on learning to be an entrepreneur and business guy. I like working with, and building, supercomputers, and it turned out there was at least some demand there at the time.
You need to ask yourself these sorts of questions, and you know, its ok to answer "I don't know". But take the time to figure out what makes you happy. Because YOLO, and its better not to waste time doing something that you won't be happy doing and not making any living doing it.
Quick plug for a company I've used: SSLmate (http://sslmate.com) makes cert purchase (for those whom need this) painless and fast. They use Comodo and Geotrust FWIW. I've had my own pain with Comodo through other resellers, and moved on to sslmate and godaddy. Recently moved my home blog (http://scalability.org) to LE. Work (http://scalableinformatics.com) is using godaddy for now, though thinking hard on using sslmate going forward for it (because ... godaddy).
A key component of the network side of our (https://scalableinformatics.com) SIOS layer is handled by Mojolicious running on Perl. Has been for a while (about 5 years).
I've taken some steps to do a re-implementation in node, but the perl version "just works" with very little fuss, under fairly heavy load, and is pretty easy to debug when I need to.
This is very much a microservice: tailored PXE environments as a service based upon a database, and the booting mac address as a key. A programmatic/database-based backend to a PXE server. It allows us to boot effectively anything that is bootable by modern hardware, from Linux, Windows, SmartOS/OpenSolaris, through FreeDOS, and other more esoteric systems. A number of our customers use this as a configuration tech underneath their own orchestration layers.
All Perl based, and using as modern techniques as possible.
We stopped using system Perl many years ago. Red Hat seems to like to ship not merely obsolete versions, but versions actually past their end-of-life, so that they are not really supported upstream anymore. They have a similar issue with Python and other languages. So, reluctantly, about a decade ago, we started building our own toolchain. First with Perl, then adding in Python, Julia, R, Octave, and other analytics codes. Our analytics tools use all of these as part of our SIOS rambooted appliances (http://scalableinformatics.com/fastpath), so we needed the updated toolchain. We are looking at Rust as well for future work, and have looked at incorporating Go, but we don't have any Go code developed/planned as of yet.
While I acknowledge the humor intended, it is important to note that nipping incorrect conjectures before people waste time building edifices atop them, is an extremely good outcome. In the strongest possible manner, this advances mathematics by labeling incorrect pathways as being incorrect.
You only need one counterexample to a theory or conjecture to prove it wrong.
... well, then there is the issue of distributing that 2TB data set. I'll get to the Amdahl's law issue in a moment.
This is a non-trivial problem. Ok, it is trivial, but its serial in most cases. Unless you start out with a completely distributed data set. And allocate permanent space on those 1000 nodes. So the data has to move once. And you can amortize that across all the runs.
In reality, you cant. We have customers using PB of data for their analytics. Even across 1000 nodes, thats still TB scale.
Our approach is not radical, its simple. Build a better architecture system, with much higher bandwidth/lower latency interconnects, so data motion can happen at 10-20 GB/s per machine. Then you can walk through your data in 50-100 seconds per machine (our customers do). And if you need to scale up/out, use 100Gb nets, and other things.
On Amdahl's law: In its simplest form, the law states that your performance is bound by the serial portion of the computation. If you can drive the parallel portion to zero time, you are still stuck with the serial portion literally bounding your performance. So lets take your example.
1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution. And at 1GbE, you can move 2GB in about 20 seconds (hurray!). But you've got 1000 nodes, so its 20x 1000, or 1/4 day. Remember, the data starts out in one bolus, unless you allocate those machines and their storage permanently. That type of allocation would be cost prohibitive.
Ok, use 10GbE. You'll actually get about 2-4 GbE speeds, but ok, So maybe 5-10k seconds to move your data. And your run is deeply in the noise, at 4 seconds.
Still not good.
For less than the cost of doing this with capable/fast machines in the cloud where you have to keep moving your data back and forth, you could get a simple bloody fast machine, that can handle the data read in 50-100 seconds.
Our thesis (ok, tooting our horn now) is that systems architecture matters for high performance large data analytics. Seymour Cray's statement about 2 strong oxen vs 1024 chickens is apt around now.
Cheap and deep are great for non-intensive data motion and analytics. Not so much for very data intensive analytics.
Again, I am biased, as this is what my company does.
> 1000 nodes, 2TB of data, assume standard crappy cloud network connection, use a 1GbE connection per node. The serial portion of this computation is the data distribution.
I find your statements confusing.
The whole point of things like hadoop is that the data is already distributed and the data storage nodes are also computational nodes. So there is no data distribution that takes 1/4 day or even 50-100 seconds. It takes 0 seconds because you just run the computation where the data already is.
My experience with HPC systems is more "Jobs are paused because the shared filesystem is unavailable"
Heh (on the jobs paused bits) ... most modern shared file systems are HA (or nearly HA) except when people build them cheaply. And then you get an effective RAID0.
I was pointing out that if you are doing the analytics at AWS or similar on-demand scenario (a common pattern I see people trying/using and eventually rejecting), you have a serial data motion step to distribute data to your data lake before processing. Then you extract your results, decommission all those servers. Rinse, and repeat.
The point is, that for ephermal compute/storage scenarios, you have a set of poorly architected resources tied together in a way that pretty much guarantees you have a large (dominant) serial step before anything goes in parallel.
What we advocate (and have been doing so for more than a decade), are far more capable building blocks of storage+compute. So if you are going to build a system to process a large amount of data, instead of buying 1000 nodes and managing them, buy 20-50 far more capable nodes at a small fraction of the price, and get the same/better performance.
It also doesn't take 0 seconds. There is a distribution/management overhead (very much non-zero), as well as data motion overhead for results (depending upon the nature of the query). When you subdivide the problems finely enough, the query management overhead actually dominates the computation (which was another aspect of the point I made, but it wasn't explicit on my part).
So we are fine with building hadoop/spark/kdb+ systems ... we just want them built out of units that can move 10-20GB/s between storage and processing. Which lets you hit 50-100s per TB of processed data. Which gives you a fighting chance at dealing with PB scale problems in reasonable time frames, which a number of our users have.
Sounds like you are doing the opposite of what joyent are doing: they basically pair (relevant parts of) data with the program (to process that part). And the reduce/aggregate over the result (that's my takeaway from joyent's marketing, anyway).
Which company do you work for? (unless it's a secret for some reason)
Actually Joyent's manta is quite similar in concept to what we've been doing for years (before manta came out). The idea is to build very capable systems and aggregate them. Not a bunch of fairly low end units (like typical AWS/etc). Our argument is that if you are going to build a high performance computing infrastructure, you ought to build it in an architecturally useful manner. The cost to do so is marginally more than "cheap n deep", while the savings (fewer systems needed for very large analytics) is substantial.
Thank you for clarifying. Re-reading your first comment, in light of your second, I see that that's indeed what you were saying in the first place. But apparently that's not quite what I read :-)
TL;DR version: Anything that is not your application won't use the resources in the same way, and will have very different properties (scaling, performance, contention, etc.)
Longer:
bonnie++ is not a load generator. Really. Start a typical run with a command line test case you find online and you'll see no actual IO. Lots of cache hits. We used to use it, 10 years ago, to play around with load generation, but found that it didn't generate enough IO, or in a way that actually matched what people do.
IOzone is marginal ... I wouldn't use it for a serious test, and when people suggest dd or IOzone, I ask them how well the code actually matches their use case. Chances are that it is also largely irrelevant to this. Worse still is that the IOzone throughput measurements are basically bogus, using a naive sum of bandwidths, rather than showing the interesting data (the actual histogram or distribution of performance, including the start/end times, and the rates per thread/process as a function of time).
fio is good, in that you can implement many types of tests that have a reasonable chance at being meaningful. I caught Sandforce controllers compressing non-random data on benchmarks they were using for the SSDs that used them with fio. Actual SF performance was lower than spinning rust once you fed it real random data.
We wrote something called io-bm (https://gitlab.scalableinformatics.com/joe/io-bm) for pounding on parallel file systems (specifically to stress them and see how they scaled and dealt with contention for network, metadata, etc.). I am not sure if the repo has our histogramming and time series bits in it so we can see individual thread performance as well as overall performance, might be in a private repo.
Basically it boils down to the TL;DR above. If its not your code, then you are likely testing with an application that is somewhere between partially to wholly irrelevant to your use case.
[commercial alert]
My company (Scalable Informatics) literally builds very high performance Ceph (and other) appliances, specifically for people with huge data flow/load/performance needs.
Relevant links via shortener:
Main site: http://scalableinformatics.com (everything below is at that site under the FastPath->Unison tab) http://bit.ly/1vp3hGd
Ceph appliance: http://bit.ly/1qiOYpy
Especially relevant given the numbers I saw on the benchmarking ...
Ceph appliance benchmark whitepaper: http://bit.ly/2fMahfJ
Our EC test was about 2x better than the Dell unit (and the Supermicro unit), and our Librados tests were even more significantly ahead.
Petabyte scale appliances: http://bit.ly/2fuTTAH
We've even got some very nice SSD and NVM units, the latter starting around $1USD/GB.
[end commercial alert]
I noticed the 10k RPM drives ... really, drop them and go with SSDs if possible. You won't regret it.
Someone suggested underprovisioned 850 EVO. Our strong recommendation is against this, based upon our experience with Ceph, distributed storage, and consumer SSDs. You will be sorry if you go that route, as you will lose journals/MDS or whatever you put on there.
Additionally, I saw a thread about using RAIDs underneath. Just ... don't. Ceph doesn't like this ... or better, won't be able to make as effective use of it. Use the raw devices.
Depending upon the IOP needs (multi-tenant/massive client systems usually devolve into a DDoS against your storage platforms anyway), we'd probably recommend a number of specific SSD variations at various levels.
The systems we build are generally for people doing large scale genomics and financial processing (think thousands of cores hitting storage over 40-100Gb networks, where latency matters, and sustained performance needs to always be high). We do this with disk, flash, and NVMe.
I am at landman _at_ the company name above with no spaces, and a dot com at that end .