Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> We changed the structure of our heaviest used models a couple times in the past year, and instead of going back and updating millions of old documents, we simply added a “version” field to the document and the application handled the logic of reading both the old and new version. This flexibility was useful for both application developers and operations engineers.

Ugh, this sounds like a maintenance nightmare. How do you deal with adding extra field to the document? Do you ever feel the need of running on-the-fly migration of old versions? (But when you do, shouldn't running a migration for all documents a better idea?)

I'll admit I'm a non-believer, but every time I see "Schemaless" in MongoDB, I think "oh, so you're implementing schema in your application?"



> Ugh, this sounds like a maintenance nightmare. How do you deal with adding extra field to the document? Do you ever feel the need of running on-the-fly migration of old versions? (But when you do, shouldn't running a migration for all documents a better idea?)

Yes, we did on-the-fly migration as we loaded old data in.

Doing full data migration was not really an option because querying from MongoDB on un-indexed data is so slow, and paging in all that data would purge hot data, exacerbating the problem.

> I'll admit I'm a non-believer, but every time I see "Schemaless" in MongoDB, I think "oh, so you're implementing schema in your application?"

That's exactly what happens.


So it's actually 100% the same as you would do with on-the-fly migrations in SQL:

(1) Add column and add code moves the old data when you access it. Deploy.

(2) Let it run for a while. Run a background job that migrates the rest (this might be done days or months later).

(3) Remove the column and the custom code.

The more I hear about "schemaless" the more I realize that it doesn't make any difference at all.


Absolutely not!!

It's more like this

You have old user table with for example: login and user name

In MongoDB this is a JSON object {login:'user', name:'User Name'}

You want to add 'shoe size'. So you add

1 - the shoe size to the signup/user editing form

2 - next user created is like this: {login:'user', name:'User Name', 'shoe_size': 7}

3 - Old users DON'T get a shoe_size added automatically to their document, but next time they login they get asked "What's your shoe size". It's dynamic. if (!user.shoe_size), done.

You add new columns on demand, and you almost never remove them (you don't need to add fields for migrating, except of course those you want to add)

You (almost never) "run the migration" for a while adding shoe_size to existing documents.

And it absolutely does make a huge difference, since you don't have to deal with the mess of relational DB migrations.

I'll never develop a system using a relational DB again if I can avoid.


How is that different than adding a nullable column shoe_size to the user table in your relational DB, and an if(user.shoe_size != null) in your application?


That's still a migration, its just incremental.


Incremental can make all the difference between zero- and hours of downtime. Do not underestimate the importance of this extra agility in the modern world of almost daily updates to web apps and backends.


I don't see where I underestimated that?


It's the same concept, but it's very different in practice. For example, adding a field with a default value in postgres will lock the table during the migration, which may be killer. If you use postgres for big data sets, what you'll end up implementing for your migrations looks a lot like what schemaless gives you for free.


If you add a default value it locks the table and re-writes it.

However, if you don't add a default value postgres will perform the operation instantly. Old rows will be untouched, and new values will be added with the shoe_size of either NULL or whatever you set it to be... IE exactly the same outcome and performance as the MongoDB case mentioned above.

Adding a field in postgres while setting a default value would be exactly the same as adding a field in MongoDB and updating every existing row with that default value (except for the fact that postgres will only lock that one table, while MongoDB will lock your entire database).

Now I'm not anti-MongoDB, I'm just saying you shouldn't give it credit for something that relational database do just fine.


Thank you very much for this quick tip. I've unfortunately only used MySQL at scale, and it most definitely grinds away for hours when ALTERing, default value or not.

It looks like MySQL might be giving everyone in the RDBS world a bad rap.


> Now I'm not anti-MongoDB, I'm just saying you shouldn't give it credit for something that relational database do just fine.

No, it can't. Sorry, it absolutely can't.

> If you add a default value it locks the table and re-writes it. > However, if you don't add a default value postgres will perform the operation instantly.

And in MongoDB you don't have to change anything in the DB (hence, no downtime). It's all in your code.

Yes, I can add a field in PGSQL, then my ORM quits on me (because I can't have a optional field, of course). Or I can use hand written SQL at which point it adds a couple of months to my schedule.

Or I can use migrations + ORM like Ruby migrations or South (Django). They will blow up for anything slightly more complex.

I also can't use the old version of the DB, so my old backups just became more troublesome to recover.

And that's why Facebook and others don't user relational anymore. Sure, you can use MySQL or PG, but the real data is stored in a text or binary blob.


> I also can't use the old version of the DB, so my old backups just became more troublesome to recover.

Tangential to the topic, but this is important: every schema update MUST be considered part of the app / database / backend / system source code, and as such it should be properly and formally applied, versioned and backed up. In the simplest case, you can store a version number in the database itself, every update applies to a certain version # X and advances to another version # Y, and every update is stored in its own source file (updateX.sql for example).

Would be nice if DBMSs offered primitives specifically for schema version management, but I guess there could be many different use cases, and it's very easy to roll your own.


> And that's why Facebook and others don't user relational anymore. Sure, you can use MySQL or PG, but the real data is stored in a text or binary blob

Is that why? I was under the impression that Facebook doesn't use the relational properties of MySQL because they couldn't get the performance out of it because of their scale, a problem a lot of advocates of the non-relational model dont' seem to have in my experience.


> Adding a field in postgres while setting a default value would be exactly the same as adding a field

Not true. Postgres locks the entire table for the duration of the update (which could be hours or days for a large enough dataset). Mongo will lock your entire databased for tiny fractions of a second, lots and lots of times. The difference is huge. The postgres implementation takes down your site, the Mongo implementation doesn't.

At scale, people start using Postgres a lot like a NoSQL service. Check out the Reddit schema for example, or read IMVU's blog posts on the topic (with MySQL, but same point). Or Facebook's architecture. All those migrations strategies look a lot more like mongo than postgres, even though they all use SQL DBs.

> Now I'm not anti-MongoDB, I'm just saying you shouldn't give it credit for something that relational database do just fine.

Saying "I can do NoSQL migrations just fine in Postgres" is like saying "I can do OO just fine in assembly".


According to the Postgres documentation, the behavior the poster a few levels up defined is possible in Postgres with no downtime.

> When a column is added with ADD COLUMN, all existing rows in the table are initialized with the column's default value (NULL if no DEFAULT clause is specified).

> Adding a column with a non-null default or changing the type of an existing column will require the entire table to be rewritten. This might take a significant amount of time for a large table; and it will temporarily require double the disk space.

http://www.postgresql.org/docs/8.3/static/sql-altertable.htm...


This is easily solved in Postgres. You dont do alters with default, or non null values. Alters become fast [enough] and bearable, even at scale.

http://justcramer.com/2011/11/10/scaling-schema-changes/


For a large table, adding a new column can be non-trivial, each record is updated when the new column is added.


If you choose to update all of the old records.

As other posts have already mentioned, adding a new column to a table in postgres does so instantly unless you set a default (telling postgres that you want it to update all the old records too).

If you add the default separately from adding the column, you get the default but don't rewrite all the old rows (which remain NULL).


That process sounds more like "implementing schemaless in SQL" than "implementing schema in NoSQL".


> I'll admit I'm a non-believer, but every time I see "Schemaless" in MongoDB, I think "oh, so you're implementing schema in your application?"

I think that is arguably one of the selling points of MongoDB. Yes, you do implement schema in your application, but should be doing that knowingly and embracing both the costs and benefits.

The benefit is that you can very quickly change your "schema" since it's just however you're choosing to represent data in memory (through your objects or whatever). It also allows you to have part of your application running on one "version" of the schema while another part of the application catches up.

The tradeoff is that you have to manage all this yourself. MongoDB does not know about your schema, nor does it want to, nor should it. It affords you a lot of power, but then you have to use it responsibly and understand what safety nets are not present (which you may be used to from a traditional RDBMS).

To your point about migrations, there are a few different strategies. You can do a "big bang" migration where you change all documents at once. (I would argue that if you require all documents to be consistent at all times, you should not be using MongoDB). A more "Mongo" approach is to migrate data as you need to; e.g. you pull in an older document and at that time add any missing fields.

So yes, in MongoDB, the schema effectively lives in your application. But that is by design and to use MongoDB effectively that's something you have to embrace.


> A more "Mongo" approach is to migrate data as you need to; e.g. you pull in an older document and at that time add any missing fields.

I think this works when you're talking about adding fields. But it really breaks down if you need to make some change regarding "sub-objects" or arrays of "sub-objects".

If you have made a modeling mistake and you need to pull out a sub-object you generally have to do simply stop the system and migrate.


That's very true. Sometimes you can't do a live migration and you have to bite the bullet.


Which means it is not too much unlike a quality RDBMS like PostgreSQL. For adding (remember to not set a default value for large tables) and removing fields in PostgreSQL there is no requirement for locking the tables more than an instant. But for complicated schema changes you may have to either force a long lock or use some messy plan either involving replication or doing the migration in multiple steps.


> oh, so you're implementing schema in your application?

Isn't that where the schema belongs? Each document represents a conceptual whole. It doesn't contain fields which have to be NULL simply because they weren't in previous versions of the schema.

I've been an rdbms guy (datawarehousing/ETL) for a long time now, I've seen a lot of large databases which have been in production for considerable time. They get messy. Really messy. They become basically unmaintainable. Apples, oranges and pears all squashed into a schema the shape of a banana.

It's a pretty elegant solution, and is the problem XML/XSD were designed to solve.

The cleanest solution that I've seen in production used a relational database as a blob storage for XML-serialized entities. Each table defined a basic interface for the models, but each model was free to use its own general schema. After 10 years it contained a set of very clean individual entities which were conceptually correct.

As opposed to the usage as a serialization format for remoting, which has been largely replaced with JSON.


> isn't that where the schema belongs? [In the application]

Well, unless you have, you know, multiple applications accessing said data. Then it's kind of important to keep it in sync, which is why RDBMS exist and operate the way they do.

In my experience, on a long enough timeline, the probability of needing multi-application access for your data goes to 1.


build an api


It's a shame this comment is pithy, because I think it's dead-on.

There times to integrate at the database level. But, the default should be single-application databases.

The rationale is the same as the grandparent's rationale FOR database integration. The odds of needing to share data over time are 1.

Given that shared belief, the problem with database integration is that MANY applications need to share facets of the same data. The single database ends up having a huge surface area trying to satisfy every application's needs.

The resulting schema will have definitions relevant for applications A, B, and C but X, Y and Z.

But, worse, there are dependencies between each application's working schema. This means ensuring integrity becomes harder with every application that integrates.

Finally, integration points are the hardest to change after-the-fact. The more services that integrate, the less ability to make the inevitable model changes necessary to fix mistakes/scale/normalize/denormalize/change solutions.

Thus, "build an api" is the best solution. Well-defined APIs and data-flows between applications helps data and process locality and avoids most of the problems I just listed. The trade-off is you're now meta-programming at the process level— the complexity doesn't disappear, it's just reconceptualised.


> The single database ends up having a huge surface area trying to satisfy every application's needs.

This is, more or less, exactly what views are for.

> Thus, "build an api" is the best solution.

And you can do that within the database with stored procedures, perhaps even with the same language you would use in the front-end (depending). And look at the advantages you have:

- No implied N+1 issues because your API is too granular

- No overfetching because your API is too coarse

- No additional service layers needed

- All the information is in the right place to ensure data validity and performance

Let me be clear: I see these as two viable alternatives and different situations are going to determine the appropriate tool. I bring this up because I do think the NoSQL crowd overall has a very distorted and limited picture of what exactly it is RDBMSes provide and why. If people look underneath their ORMs, they may find an extremely powerful, capable and mature system under there that can solve lots of problems well—possibly (but I admit, not necessarily) even _their own_ problems.


> - All the information is in the right place to ensure data validity and performance

This is where we part ways.

We're talking specifically about integration. That means each system have different processes and are talking with other people.

If this is a case of three apps exposing the same process over three different channels (HTTP, UDP, morse code); then, database-level integration makes perfect sense.

But, as soon as differing behaviors comes in, then the database level doesn't— by definition— have enough information to ensure validity. One app thinks columns X and Y are dependent in one way, the other app views it another way. Now, one or the both of those apps are screwed for validity. And this problem grows with N+1.

I am certainly not arguing against good databases. Stored procedures, views, etc. are all great even for a single application. But, I am arguing database level integration should be the rare exception to the rule.


> Thus, "build an api" is the best solution.

I think there's an asymmetry in your argument.

You are describing all of the problems with data management as though they were specific to schemas in a traditional RDBMS; but glossing over how "building an API" solves those same problems, and whether that method is better or worse.

In other words, "build an API" is the problem at hand, not the solution. A traditional DBMS provides a data definition language (DDL) to help solve the problem (schema design), which isn't a solution either, but it does offer a lot of direction. Saying "build an API" really gives me no direction at all.


Yes, I agree, "build an API" is the problem.

I specifically said the problem and the complexity inherent in solving it doesn't disappear with per-application databases.

But, the application (the 'A' part) is where the most context around the information is to be found. Tying multiple applications together at the Data-side (rather than the Application-side) means you don't lose that context... and eliminate your chances of clashing (and the ensuring integrity issues).


That's a good point about the context, but not very concrete.

At some point you'll need to go into the details, but I don't think a comment is the right format. Maybe a blog post to show exactly what you mean.


Thanks for expanding on my answer, nailed it.


A schema is essentially an API over the data.

For instance, if you have an API with the functions:

  item# = add_item(description, price, color)
  record_sale(item#, quantity, customer#)
So let's see what happens when you need to make a change. For instance, you might decide that an item could be discounted (a new feature); so you need to record the price at the time of sale:

  record_sale(item#, quantity, customer#, price=NULL)
and if the price is NULL, you default to the current item price.

But you already have sales data without a price, so you need to migrate it by copying the current item prices over.

And if one of three applications doesn't support the new API, you need to offer the old API as well and it would just use the default all of the time.

That sounds pretty much like a schema to me. What's the difference?


So you want to put an API in place to access your schema-less database? So that multiple applications can access your database in a consistent manner? That makes sense.

You know what though... you may want to use multiple programming languages, and maintaining an API in multiple languages sucks. So, why not just make a service that each application can connect to, and then let it talk to your database. Then you just need to define a simple protocol that language can use to connect to your service.

Or, you could just skip a few steps and use an actual database to begin with.


Building an API is still a great choice no matter how good your database is. Why would you not have an API if you are using the data store from 5 difference languages?


Agree with this 100%. In the past the way different applications integrate with each other was through sharing the same database tables. Hence the need to keep the schema synchronized across multiple applications. Nowadays applications should integrate through well-defined APIs and service interfaces. See this Martin Fowler short essay on database styles: http://martinfowler.com/bliki/DatabaseStyles.html


> Hence the need to keep the schema synchronized across multiple applications

Views are designed to avoid that problem.

What is the specific way you handle versioning in an API that is superior to versioning in the schema?


I am a schemaless believer.

BUT (big BUT) have you ever tried to program to a schema that has a XML blob? It's really difficult and painfully slow on the app side. The ops guys like it, because it's easy for them to maintain.


I find it much easier, but then I rely on a framework for serialization/de-serialization..


> They get messy. Really messy. They become basically unmaintainable. Apples, oranges and pears all squashed into a schema the shape of a banana.

Moving it to your application doesn't solve this problem, it just pushes it out of the DB.


Exactly! In the best place you have to handle it. A nice API can make sense about the operation on the data and help it evolve.

These discussions always creates in my mind a picture of someone saying that we should be refilling all the forms that you have in a cabinet full of hand filled forms because someone decided that a new version of the form requires a new field like an e-mail address.


There are many serialization formats (e.g. Apache Thrift) with versionable schemas. You can do a poor man's Thrift with json, mongo, etc.

It's a common thing to throw thrift or protobuf values in giant key-value stores (cassandra, replicated memcache, hbase). You don't need to migrate unless you change the key format. If you want do on the fly migrations (mostly to save space, and have the ability to delete ancient code paths), you can do them a row at a time. And yes, we do occasionally write a script to loop through the database and migrate all the documents.


> I'll admit I'm a non-believer, but every time I see "Schemaless" in MongoDB, I think "oh, so you're implementing schema in your application?"

I saw what may have well been 'schemaless' in an RBDMS recently, and the application code for it was far from pretty. I couldn't migrate at all; the results were far too inconsistent to pull it off reliably (you know something is wrong when boolean values are replaced with 'Y' and 'N', which of course both evaluate to 'true').

That being said, I tried to implement something else with Node.js and MongoDB, and I found it quite manageable. As long as the application implements a schema well, you should still be able to infer it when looking at the database direct.

To that extent, I'd take that over using an RBDMS as a key/value store for serialised data, because that's typically useless without the application that parses it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: