That's bog standard -- every company uses hadoop. Then when you see the actual d...

mistermann · on Sept 20, 2015

> Then when you see the actual datasets, they're maybe a couple hundred gigs completely denormalized. Yet you still have to use hadoop/hive/spark to access them, with all the inefficiencies, complexity, and slowness those bring.

I was always under the impression that one of the benefits of NoSQL was its speed, but then watching a webcast the other day querying a very small dataset, I was shocked at how slow it was, and this was in contrast to another demo where a different query was mind boggingly fast compared to comparable performance on a traditional SQL platform. (Yes, I know the particulars matter here and it's not that good of a question without that specificity, but any light you could shine on this would be appreciated.)

For data of "a couple hundred gigs", what platform would you say is more appropriate?

x0x0 · on Sept 20, 2015

no, the benefit of nosql, at least for data science, is scalability. ie what do you do when you can't fit the data on a single machine. This works great at a former employer, who really did have pb scale datasets. The vast vast majority of companies do not have pb scale datasets. Most don't have tb datasets.

as for what do you do, postgres / mysql; pandas /R; or roll your own code depending on precisely what you need. But you can rack a pretty beefy box with 256g ram in it, 2 xeons, and a ton of ssd + spindle disk for $10k. Nothing that nosql or hadoop or spark do can't be done easier, written way faster, executed faster, and kept running more easily on a single box or even better in a single process.

For example: at my current gig, I work on 20-40g raw datasets. Ingest to pandas and externalize user agent strings drops it to 5g or so. That process takes 30 to 60 minutes, but I do it once, cache the results, and update incrementally.

bane · on Sept 20, 2015

Postgres, or depending on the particulars just start rolling your own.