> Data never changes, but we have the possibility to create a new version of the data.
Well, it depends on what you mean by data. To avoid ambiguity it is better to talk about data values and data objects which have different properties. This can be formalized as follows [1]:
o data values are modelled via mathematical tuples – tuples are immutable
o data objects are modelled via mathematical functions (one field is a function from this reference to the field value) - functions are supposed to be mutable
(In reality of course we meet quite different situations, for example, struct is mutable and objects can be immutable.)
Here is one possible implementation of the concept-oriented model of data for data processing. It heavily relies on functions and operations with functions and is an alternative to purely set-oriented approaches like map-reduce or join-groupby (sql):
Functions are a mapping between a domain and a codomain, the mapping absolutely isn’t mutable, the definition of the function is the relationship between the domains.
If I have a function:
int Add1(int x) => x + 1
I would expect the domain and codomain to be immutable; I would also expect that x+1 to not turn in x/2 randomly also
Assume f: X -> Y. We can now map x_1 to y_1 f(x_1)=y_1. And then change this same function by mapping x_1 to y_2: f(x_1)=y_2. Thus we can easily modify functions. Moreover, we do it constantly when we modify object fields in OOP. It is probably easier to comprehend if a function is represented as a table which we modify.
In contrast, we cannot modify data values (mathematical tuples). Say, x=42+1 means that a new value 43 is created rather than the existing value 42 is modified.
> I would expect the domain and codomain to be immutable;
No. Domains, codomains and any set can well be modified by adding or removing tuples. What is immutable are values (in the sets).
> Assume f: X -> Y. We can now map x_1 to y_1 f(x_1)=y_1. And then change this same function by mapping x_1 to y_2: f(x_1)=y_2
They would be different functions, the first being the identity function: x => x, the second being: x => x + 1
> Thus we can easily modify functions. Moreover, we do it constantly when we modify object fields in OOP
This isn't the case. A field with a different value in it just means the object is a different value. If the object is passed to a static function, then the domain is the full set of possible values that the object can hold (this is known as a product-type, you multiply the total possible values of each of its component parts to find out the size of the domain).
If it's passed to a method then there's an additional implicit argument: `this`, which is the same as a static function with an additional argument that takes the object. The function is the same.
Global (or even free variables) should also be considered part of the domain: i.e. it's akin to implicit arguments that are being passed to the function.
> No. Domains, codomains and any set can well be modified by adding or removing tuples.
This also isn't the case. If a function is defined that takes an integer and returns a boolean value: Int → Bool then the domain is the set of integers, the co-domain is True and False. You can't pass a tuple to a function that takes an Int and therefore dynamically increase the size of the domain. Even in dynamic languages the codomain is effectively `top`, the type that holds all values, and therefore the domain is all values and the codomain is all values, which makes them immutable still.
Now maybe I am misunderstanding you, but this is how all of the mainstream statically and dynamically typed languages work. Perhaps there's some edge-case language that I'm missing here that allows types to be extended, which would be interesting in its own right.
Can you expand upon this? Perhaps the difference between "re-mapping" the function:
f(x_1)=y_2
and "re-mapping" the value:
x=42+2
How is the former different than the latter? And by what mechanism is the former achieved? I understand what you are saying, but how does one simply "change this same function"? Redefine it?
To be clear, I'm not suggesting you are incorrect. I just don't fully understand what you are getting at.
I agree that data modeling is underestimated but it hardly can be considered a solved problem. It is very hard because there are numerous alternative understandings and formal defs of what we mean by data (RM, 00, OR, MD etc.) In additiin, there are several levels of representation (physical, logical, semantic). In real projects, they all are mixed.
The concept-oriented model tries to overcome some problems of RM by relying on two constructs: sets and functions. In contrast, RM uses only sets. The idea is that data can be stored in functions and transformed via operations with functions.
Sure, but we don't typically do sophisticated machine learning on them. The vast majority of modern CPUs have vector instructions. Even a Raspberry Pi's ARM has NEON.
Excuse me if this sounds stupid but, vector instructions are assembly. I know we can use inline assembly or compile and link assembly alongside C but isn't it the compiler that is in charge of using vector instructions ?
IIRC GCC has -mmx, -sse(2|3|4) options to enable this kind of instructions.
Sure, if the compiler can find optimizations by inserting vector instructions, it will. But, typically you'll want to specifically format your code using matrices/a library like BLAS to maximize performance and use as many vector instructions as possible.
"Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits."
There are many projects aimed at making Excel "a thing of the past" but they focus on different needs:
I think this is why Excel/Google Sheets still dominate and will continue to.
All of these products do one or a few things things that Excel or Google Sheet do, but perhaps they make it little easier for a novice. I think what people don't understand about Excel (and to a lesser extent Google Sheets) is that it's an IDE. A novice can build interesting things, a power user can build incredible things.
The only way to make Excel a thing of the past would be to make a blow-away awesome replacement that does everything excel does but better. There's plenty of blue ocean around Excel and I think each of the products you listed could do just fine.
I would love all of Excel's power available to me but delivered like Google Sheets. That would definitely kill Excel. So far neither Microsoft nor Google seem really committed to this. Google Sheets is nice but just grabs the low hanging spreadsheet fruit. Office 365 is anemic.
If I had the time and funding I'd love to make a true Excel killer that was a faithful recreation of ALL of Excel's capabilities but delivered in a modern way. I'd pay good money for this. I believe many would. Excel may be a dinosaur, but it's still the apex predator.
I can't speak for ryanmarsh, but I have a few thoughts.
* Excel really has two major ways of performing computations on data - formulas within the grid and actions upon the grid. Despite the utility of the formula based dataflow model, there are too many operations that have to be performed as one-shot operations via commands (or scripted via VBA). Having formula based approaches for sorting, dividing into bins, etc. would be very useful.
* It'd be nice if Excel cells could contain values other than scalars. (Arrays, tuples, lists, maps, matrices, complex numbers, etc.)
* VBA can be used to define custom functions, but there's a lot of marshalling overhead going to VBA and the programming model is slightly different. It'd be nice to fix both of those issues.
* There's no way to locally bind names within a cell formula, so often subexpressions have to be duplicated. (And I believe they're doubly evaluated too.)
Shameless plug: I'm a founder of Alphasheets, a company seeking to solve problems like these! I couldn't resist replying after seeing these comments.
We make a collaborative (Google Sheets style) spreadsheet with Python and R running in the sheet. You can define functions, plot using ggplot, embed numpy dataframes, matrices and all that good stuff. We don't let people use macros, all the code runs in cells because we think macros are too brittle. You can check out the website at http://alphasheets.com .
We're seeing that many enterprises (for example, in finance) that have Excel power users are moving to Python because of limitations like these, and are running into adoption issues because people like spreadsheets so much. That's generally where we come in and provide a bridge from the Excel world to Python through a more friendly frontend.
We're also seeing that Alphasheets can help a lot with shortening feedback cycles on more sophisticated data analyses- Excel is the most popular self-serve analytics tool out there, but doesn't cover cases where you need Python/R/fresh data.
This is very nice. Problem is, there are sooo many more features in Excel you'll have to copy to get me to move. If you ask "which ones" I'll say "all of them". I'm a power user. I build huge dashboards and analytical tools in Excel. The thing I hate most is that all my work goes into a file that I have to pray works on the other persons computer.
The product is great. But you guys will need to launch a fully feature rich desktop client, which can sync with the cloud.
Else its the same thing mentioned in the previous comments. You would build a web app with 5% the features of excel, and the moment somebody reaches use case that can't be solved with your tool, they will have to switch to excel. If they have switch every second time they use your product. They might as well do all their work in Excel to begin with.
You have to be feature compliant with excel and you can't do that on a web app alone.
* see a modern replacement for VBA, dare I say using JS
* be able to share a document that won't break when someone opens it on their computer (even if its using all the excel bells and whistles including external data sources and plugins). Google Sheets by contrast, is just a link.
* be able to use all the amazing features via the web and/or an app
Let's call Google Sheets "modern" because it can be used from an app or any web browser. I can share a Google Sheet much easier than an Excel file using all the bells and whistles.
The problem is, Excel has a ton of very powerful features. Many of which Google Sheets doesn't provide. Something like VBA would be nice. I'm aware you can write JS plugins for Google Sheets but the experience is no where near as good. Pivot tables in Excel still smoke Google Sheets.
The witheve.com stuff (and the underlying "differential dataflow") is also interesting as a model for derived data which updates itself. I'm keeping an eye on that project too.
As far as your site goes (I just took a brief glance), if you haven't seen it already, you might find some interesting ideas in the sieuferd project:
I wonder if a kind of hybrid programming would be possible which switches between this dataflow-like functionality for parts and more traditional ('large blocks of text'-based) techniques for other parts.
I was working on a simple framework a while ago where the highest-level organizational structure was a 'domain' and these domains would connect to one another via 'converters'. I think the dataflow format would work really well for defining and linking up domains, and small functions that do things like filtering would work well within converters—but then maybe within particular domains it's somewhat of a free-for-all again (i.e. you use traditional programming techniques). Just thought I'd share the idea on the off chance that it sparks something for ya—I'm not really doing anything with that project at the moment.
I'm also curious why you prefer the tabular format over something graph-based. Is it just that it's more straightforward for people to lay things out/organize?
I'm not currently a user but https://www.smartsheet.com/ had pretty advanced features that I liked. The UI was a bit old-fashioned but the tool is capable.
From this (and many other) tutorial it is not clear if tensors in tensorflow are true mathematical tensors (that is, having covariant and contravariant indices) or they are multidimensional arrays. The name Tensorflow and terminology suggests that Tensorflow manipulates mathematical tensors, for example:
but what you see are multidimensional arrays. It is of course not a big problem but probably could be clarified somewhere at least in small font to avoid ambiguity. Or Tensorflow objects are true tensors indeed?
Even if they were (I doubt), I haven't found a clear and informed description how they relate exactly to tensors found in math/physics literature. I agree with your view that they look more like nd-arrays.
I was surprised when I first saw the word "tensor" being thrown around by computer scientists to apparently mean just multi-dimensional array. But then I thought, well, "vector" is very widely used - including by mathematicians - to mean simply an nx1 or 1xn array, rather than an object which transforms a certain way under coordinate tranformations. So in the same way, I suppose we really might as well use "tensor" to mean "just" a multi-dimensional array of numbers, in contexts where coordinate transformations aren't important. Mathematical physics can continue to use the other definition where necessary, just as it does for vectors.
The trouble with that approach is that in CS, tensors are mostly used in machine learning, which is very math-dependent. So, you read in a textbook or a paper that something can be done elegantly by using some linear algebra operation, or some transformation on a tensor, and are delighted, because your library says to be tensor-based, but, then, when you try to code it, whoops; you meant you had tensor support, but all you've got is a multidimensional array memory layout...
The lifetime word limit is a too rigid constraint. There should be a possibility to acquire (or lose) the points you can later use for publishing your results. Then the "price" of publishing negative results could be lower than other results.
HPAT looks pretty promising. I wonder how they managed to signficantly increase performance of shuffling and sorting which are known to be quite difficult operations in map-reduce.
Well, it depends on what you mean by data. To avoid ambiguity it is better to talk about data values and data objects which have different properties. This can be formalized as follows [1]:
o data values are modelled via mathematical tuples – tuples are immutable
o data objects are modelled via mathematical functions (one field is a function from this reference to the field value) - functions are supposed to be mutable
(In reality of course we meet quite different situations, for example, struct is mutable and objects can be immutable.)
[1] Concept-oriented model: Modeling and processing data using functions https://www.researchgate.net/publication/337336089_Concept-o...