More

mikeskim · on July 23, 2016

I went from barely using data.table to only using data.table for basically everything in less than a few years. I think this is the trend given it's faster than basically everything: https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A...

mikeskim · on June 9, 2016

The way I use Python in machine learning is quite different from how many others in competitive ML use Python. I use Python purely for Python 2.7 with Pypy and try not to touch or use numpy,scipy,pandas,etc. R's data.table is possibly faster than Python's numpy/scipy/pandas. I think anyone claiming Python because of numpy/scipy/pandas is really being mislead. You should be using Python in spite of the need to rely upon numpy/scipy/pandas. If you really need numpy/scipy/pandas just use R and data.table which is amazingly fast. I think Python is really great because of Pypy and the strength of the standard Python library.

mikeskim · on June 2, 2016

It is a fact that students at top tech schools (think Caltech, MIT, Harvey Mudd, etc.) study longer compared to students at comparable liberal arts schools. I believe it takes time and practice to get good at anything. Students at tech schools in tech majors just spend more time practicing their craft. Obviously that doesn't mean some English major can't pick up math, computer science and physics later in life, but such people are the outliers.

mikeskim · on May 9, 2016

I wish academics would publish pure python implementations of their "new" algorithms. Standard python with Pypy is enough for speed of development and runtime.

The biggest thing about t-SNE is that it's been used in competitive machine learning for quite a long time successfully by many different people because it's on R via CRAN and Python via sklearn. LargeVis has potential, but it could also be not so useful like the vast majority of academic work.

visarga · on May 9, 2016

After a paper like this comes out, usually someone adds it to their library or makes a public implementation of it. Usually there are more than one. Maybe the authors aren't the best at implementing a clean version of it.

zo1 · on May 9, 2016

Any version is better than none. Even if it is just a basic/crude reference implementation that library-authors can use to add the algorithm/functionality to their library.

mikeskim · on May 2, 2016

I would not be surprised if most academic CS research is not reproducible. This is true for many other fields outside of CS, and I've seen it first hand in machine learning. It's a problem but it's also just how things are.

mikeskim · on April 27, 2016

I know people who work at those research divisions (they work with neural networks) without a PhD. So I just falsified your claim.

eli_gottlieb · on April 27, 2016

Well yes, barriers to entry do get lower when markets are hot. What qualification do these guys you know have?

mikeskim · on April 24, 2016

This is incorrect in almost every way. When you have 2^m independent observations that you can use to cross validate (where m is very large), overfitting is exceptionally difficult almost regardless of the number of features you have. Overfitting typically occurs when the number of data points is small in magnitude overall and is small compared to the number of features and the observations are not iid.

mebassett · on April 24, 2016

I think he's talking about growth in the features (dependant variables) of your dataset while keeping the number of independent observations constant; not growth in the dataset due to new independent observations.

I think he's correct in discussing it - I find folks propose new features far more frequently than new observations become available.

Fede_V · on April 24, 2016

When people talk about big data, they usually discuss datasets with lots of observations (aka, rows). Not datasets with lots of features but few rows - those are far more common in fields like genetics or omic sciences in general.

mikeskim · on March 27, 2016

This is supposed to be faster than XGBoost? I'm skeptical, but I'd like to know the specifics of the benchmarks and maybe an outline of the code / reasons why. It was not benchmarked by the same person who did https://github.com/szilard/benchm-ml

kotach · on March 28, 2016

Vowpal Wabbit is IO limited. Meaning that there's no way it is slower than anything else on a single machine. On multiple machines it glides faster than light.

So, the benchmark is probably incorrect for VW.

mikeskim · on March 7, 2016

If people would just learn to resample (cross validate, use subsampling or the bootstrap), we wouldn't be having this pointless discussion at all.

vasilipupkin · on March 7, 2016

In theory, you are correct, but some data sets are too small or make it too difficult to cross validate in a way, that is meaningful to the original problem

mikeskim · on March 6, 2016

I'd like feedback on if this algorithm is new or has been published before. I think it's new, but I've been wrong before on these types of issues. Thanks: Mike.