To be really automatic the only thing i should need to do is feed an csv, correct the suggested data types and then run all algorithms on the data with the information at the end which has been the most effective and which i should further optimize.
Sure, but shotgunning every statistical test and machine learning algorithm completely undermines your results, because the power and significance levels are not adjusted for many experiments. In the statistical setting this leads to spurious correlations, and in the machine learning setting it leads to overfitting. In either case the results have a high risk of not generalizing beyond the initial sample being analyzed.
I'm not saying you're endorsing this, but it's basically antithetical to sound experimental design. I don't think the author should pursue automatic anything when it comes to statistics, unless it's just a thin quality-of-life wrapper around other statistical primitives and libraries.
There are for example multiple decision tree learners or rule learners. Everyone has different semantics and works differently on the data. If i just can run every one and see which one performs the best is a completely normal approach.
And with k-fold cross validation its very hard to have overfitting.
It's not that you intrinsically need a human, it's that doing this without human oversight requires being very careful not to make tricky mistakes.
The nature of statistical significance (which underpins everything you've said), is that repeating many experiments reduces the confidence you should have in your results. Supposing each algorithm is an experiment and each experiment is independent, if you target a significance level of p = 0.05, you can expect to find 1 correlated feature out of every 20 you test just by chance.
Can you automatically correct for this? Sure. But this is just one possible footgun. Are you confident you're avoiding them all? In theory automation could do an even better job than a human of avoiding the myriad statistical mistakes you could make, but in practice that requires significant upfront effort and expertise during the development process.
At a certain point doing this automatically becomes analogous to rolling your own crypto. It's not quite an adversarial problem, but it's quite easy to screw up.
I agree that cross validating would work; that's what I was gesturing to when I was talking about making an assessment of the data and partitioning it. Either the provided sample should be partitioned for cross validation, or it should prompt the user for a second set.
Correct. My point is that both humans and machines face the same issues. At least with a machine you get a consistent errors (that does not cost you time), which you can decrease with time.
With humans you must make sure that the same human with the same skill set who knows stats at Master Level, will always be there for your specific data and actually have the time to do the experiments.
Also, I think that 95% of the users/consumers of machine learning are non consumers - I.e. they do not have ANY access to any machine learning tech, and thus need to revert to guessing.
So the ethical thing to do is actually give them some tool even if it might not be optimal.
No, it's emphatically not a great feature, and it's not clear to me the commenter was recommending that so much as making a nit. Please don't automate the process of choosing and running algorithms on a single sample of data, it's unsound experimental design that undermines your results. If you insist on doing it anyway, at minimum you will need to automate an initial assessment of the sample data to determine if it has a suitable size and distribution to allow you to adjust the significance of results for the number of tests you're running, and partition the data into smaller subsamples.
Hi, thanks for your comment. I actually understood that he meant something like a hyperparameter search/tuning using cross validation (at least that what came in my mind).
Parameter tuning and algorithm selection! I just don’t want to manually start 5 different runs of algorithms i believe which could work good on the data and manually compare the results. And maybe i was too lazy to run the 6th algorithm which now performs much better.
But to be sure, every test should be done with k-fold cross validation. The decision whether to split the training set should not be chosen by the user. It‘s crucial that this is a must!
Cross validation would be good! I think if you build this in you could automatically run a few heuristics to see if the data can be partitioned, or maybe just prompt the user for another sample of the data with the same distribution.