Show HN: TerminusHub, Distributed Revision Control for Structured Data

koirapoika · on Sept 9, 2020

Hi! It seems some graphic content is served via 'http' instead of 'https', raising an exclamation mark in browsers: https://terminusdb.com/documentation/

Also, a small typo in 'Quickstart Ac(c)ademy'.

Thanks, and looking forward to checking TerminusHub!

LukeEF · on Sept 9, 2020

Yes - we noticed that about the hardcoded http graphics! Need to fix. Thanks for that and the typo spotting.

LukeEF · on Sept 9, 2020

Fixed! Thanks

LukeEF · on Sept 8, 2020

Core team here! Amazing to launch after years of work.

Computers are fantastic things because they allow you to leverage much more evidence when making decisions than would otherwise be possible. It is possible to write computer programs that automate the ingestion and analysis of unimaginably large quantities of data.

If the data is well chosen, it is almost always the case that computational analysis reveals new and surprising insights simply because it incorporates more evidence than could possibly be captured by a human brain.

And because the universe is chaotic and there are combinatorial explosions of possibilities all over the place, evidence is always better than intuition when seeking insight.

As anybody who has grappled with computers and large quantities of data will know, it’s not as simple as that. The joy of analysis and insight is often crushed beneath a mountain of tedious data sourcing, preparation, management and cleaning tasks - ugly ETL scripts that are a horror to maintain and the double horror of trying to extract data with unknown character encodings from undocumented legacy systems - CP-1252 and its friends.

It shouldn’t be like this; it doesn’t have to be like this. Computers should be able to do most of this for us. It makes no sense that we are still writing the same simple and tedious data validation and transformation programs over and over ad infinitum. There must be a better way.

This is the problem that we set out to solve with TerminusDB. We identified two absolutely indispensable characteristics that were currently sorely lacking in data management tools.

The first one was a rich and universally machine-interpretable modeling language. If we want computers to be able to transform data between different representations automatically, they need to be able to describe their data models to one another.

The second major missing requirement is effective revision control. Revision control technologies have been instrumental in turning software production from a craft to an engineering discipline because they make collaboration and coordination between large groups much more fault tolerant - and boy humans produce faults. The need for such capabilities is screamingly obvious when dealing with data - where the existence of multiple versions of the same underlying dataset is almost ubiquitous and with only the most primitive tool support.

In October 2019, we released version 1.0 of TerminusDB - it contained the culmination of 4 years of building out the data modeling capacity that we needed - the W3Cs web ontology language with a closed world interpretation.

We chose this solution firstly because OWL is by far the best thing humanity has yet produced in terms of a rich, machine-interpretable data modeling interchange format. It is essentially first-order logic with set operations - when it comes to platform interoperability nothing beats mathematics! And adding a closed world interpretation to OWL turns out to be surprisingly easy and semantically unproblematic (closed worlds are contained within open worlds).

In January 2020 with version 1.1. we released the first version of our immutable revision control storage layer - with many of the ideas shamelessly borrowed from git, but expanded significantly because when dealing with data, you need to distinguish between things like schema and instance data and keep them aligned. It turns out to require a significantly more complex structure of internal pointers, but it can be done!

In June 2020, we release version 2.0 - this included the revision control API - push, pull, branch and merge fully integrated with the database, query, and modeling engine. At this stage the database itself was more or less complete in terms of features, but there was one more critical and important point before we could say we had managed to deliver on our vision.

With distributed collaboration technology there is always a bootstrapping problem - it’s no use having technology that allows you to collaborate on data in a peer-to-peer decentralized way unless there are other people out there to connect to and collaborate with. To overcome that problem, we needed to deliver an infrastructure that would allow people to get started, to share and find data and collaborators.

Today we release TerminusDB version 3.0 and at the same time we open the doors of TerminusHub. The database is now fully integrated with the hub, allowing all TerminusDB users to share, store, publish and collaborate on databases with other users and do so at the grand price of free. With this release, I think we can say that we have a product that delivers on our vision.

The product itself is only a means to an end. We built TerminusDB to take away the pain of building amazing evidence bases for computational analysis. Although we will relentlessly continue to focus on product and remove every pain point that shows up, we now have the type of tool that we wanted. Now we are going to build some truly wonderful data resources.

gugagore · on Sept 9, 2020

Cool! I tried to find some information on what versioning is like. I'm not very familiar with OWL and RDF, but I'm trying to get a sense whether the diff between two ontologies is like the set difference between the set of RDF triples, or if there is additional structure on top of that, so that the diffs are more "semantic".

amoeba · on Sept 9, 2020

Not OP but good question. The diffs end up being semantic and not merely a difference of triples. This is due at least in part to the open world nature of RDF and concepts such as reasoning and materialization.

chekovcodes · on Sept 9, 2020

In our world the diffs are actually simply triples - the consequences of, for example, changing a class can be complex though but the diffs are just "these triples minus, these triples plus" which really helps simplify things.

We also support a bi-modal interpretation of ontologies - if you put them in a graph of type "schema" they are interpreted in a closed world way, in a graph of type "inference" they are interpreted as being open world.

In our experience the inference stuff is super cool (for example we use a property chain rule in our system database to allow arbitrary nesting of authority domains) but 99% of the effort is trying to make sure that your data and schema line up and for this, you really want to be operating in a closed world regime.

gugagore · on Sept 9, 2020

Thanks. It would be great to get some more detail on this.

Quick note, it seemed like it supports only closed-world ontologies.

padraic7a · on Sept 8, 2020

That's a generous free tier - unlimited databases!?

I've been enjoying the blog posts about graph databases and technology choices.

Looking forward to checking this out. Good luck with the launch!

LukeEF · on Sept 8, 2020

We use succinct data structures to represent our graphs. Succinct data structures are self indexing data structures that approach the information-theoretic minimum size of representation. Additionally, all the compute is on the local machine - we pass around highly compressed memory cores and do all of the orchestration (commit graph, permissions etc.) in the Hub. So far, the cost is very low so we are able to be generous in the free tier.

For example, we have DBpedia available to clone on Hub now, we get about 10x compression on that so can serve it up really efficiently (clone to query in about 2 minutes with a normal enough connection).

Hope is that we'll be able to keep it that way forever to facilitate public interest and open source work.

gugagore · on Sept 9, 2020

It looks like this FAQ answer in meant to point to a blog post, but I don't see a link:

https://terminusdb.com/docs/frequently-asked-questions#why-n...

LukeEF · on Sept 9, 2020

Thanks - will put in the link.

Naac · on Sept 8, 2020

Are you using a self-signed cert?

I'm getting a "Warning: Potential Security Risk Ahead" from Firefox.

chekovcodes · on Sept 8, 2020

Yes, because the desktop client runs on localhost (or 127.0.0.1 actually), there is no way to produce a valid self-signed cert that won't cause browsers to complain. The desktop electron app causes this problem to disappear from view, but otherwise it's more or less unavoidable if you want to allow people to connect over https to a service running on loopback - in our case it's really what we want as we're a bit paranoid about exposing users to internet-borne risk!

Naac · on Sept 8, 2020

What? Why can't you just use a reverse proxy that terminates ssl?

I don't think you're going to get any traffic if the landing page people see is a security warning.

chekovcodes · on Sept 8, 2020

I don't think a reverse proxy would work in this case due to the JWT security requirements (callbackable internet accessible endpoints). In any case, that's one of the big reasons we are favoring the electron app as the primary desktop interface because it takes the problem away. The browser based interface is targeted at server deploys and in those cases if you're hosting from a proper ip address, you want to install your own cert of course.

Naac · on Sept 8, 2020

I don't know what to tell you, I personally run many services this way, and I've yet to encounter a situation where this doesn't work.

Regardless, your landing page currently is inaccessible. That should be a major concern.

chekovcodes · on Sept 8, 2020

It is a major concern - one of the things driving our focus on getting the electron / desktop out to fix the problem - which came out yesterday and makes that terrible warning disappear - allowing your users to authenticate directly from localhost to whatever internet authentication providers they want is one of the few situations where reverse proxying can't work (it's a man in the middle attack). If you could set up reverse proxies that allowed you to, for example, sign in to your google account or other oauth provider from port 80 and it did all the https for you, the internet would be in big trouble. It's one of the big reasons driving the popularity of stuff like electron IMO - it allows you to hide all the browser generated warnings that come with running https on localhost (which you really want to do for security reasons anyway) and it's one of our major and main motivations for choosing electron as our primary desktop package.

LukeEF · on Sept 9, 2020

Are you talking about the website, which gives a warning due to some http graphics (which we need to fix), or the cert issue when connecting to the console of TerminusDB/Hub? The first we can fix quickly (and is just on the community page I think). The other is a bigger issue with certification and the reason we moved to electron apps as the standard delivery.