Uh, I don't know what world you live in but I'd like the address because mine sucks in comparison.
Text logs are definitely not a "universal format". Easily accessible, sure. Human readable most of the time? Okay. Universal? Ten times nope.
Give you an example: uwsgi logs don't even have timestamps, and contain whatever crap the program's stdout outputs, so you often end up with three different types of your "universal format" in there. I'm not giving this example because it's contrived, but because I was dealing with it the very moment I read your comment.
But at least you have a fighting chance. What if that exact same data was dumped into a binary file, that you did not know how to decode?
Originally, you had a problem - the data wasn't formatted in a manner that you could parse cleanly.
Now, you have a new problem - not only is the data not formatted properly, it's now in some opaque binary file.
Saying that there are poorly formatted text files isn't a hit against text files, it's a hit against poor formatting. The exact same problem exists if the file is in binary form, and not formatted properly.
> a binary file, that you did not know how to decode
I guess nobody ever advocated putting stuff in a binary file with an undefined format. Databases, syslog-ng, elasticsearch and the systemd journal all have a defined format with plenty of tools to access the data in a more structured way (eg. treating dates as dates and matching on ranges).
I agree the issue at hand is not just binary vs. plain text, it's more "how much you want to structure your data".
The classic syslog format is very loosely defined, with every application defining its own dialect, each with its own way to separate fields and handle escaping. To fix that you could store the log data as JSON as many online services are doing. But once you have JSON, grep is no longer enough to properly handle the data even if it's still plain text. Now that you have both a quite verbose format on disk and the need for custom tools, why not store the log as binary encoded JSON (eg. something like JSONB in PostgreSQL)? Or make it even more efficient with an format optimized for the specific usage? Add some indexes and you get more or less what databases, ElasticSearch and the journal do.
Also keep in mind that most of the logs right now gets rotated and compressed with gzip, I'd doubt that the above binary formats are less resilient to errors than a gzip stream.
That's what the grandparent was explaining though. We have near-ubiquitous tools for dealing with plaintext files. Every Linux admin knows them and uses them in many more situations than just log files. They can be scripted and piped, and an admin worth his salt could easily find the info he needs with them.
A binary file from whatever logging system, OTOH, is effectively proprietary. Even if the logging system provides you with tools to work on them, you have to 1) know that it's a log file for that logging system, and 2) be familiar enough with the tools in order to work with it.
And the specs will be gone in 40 years. While ASCII will stick around.
Why would they be gone? You realize ASCII is a 'spec' too?
If a binary format has an open specification, it's as future proof as ASCII. ASCII's durability is due to a clear and open specification that's easily implemented. Not some magic sauce that makes it instantly human readable.
That text you see? It's not what's actually in the file. That's just 1's and 0's like every other format. There's literally no difference between ASCII and any other "binary" format.
Does that really matter? Log files are often unimportant when they get over a month or two old, what is it in your log files that has to be kept for 40 years?
Longevity of log files hardly seems like a reason to pick an otherwise inferior format.
It is not about reading 40 years old logs, but rather reading logs from today generated by 40 years old system.
For example, many nuclear power plant in the west were built 40 years ago. Amongst the myriad of sensors, devices in a power plant, I think that most of them are outputting ASCII logs. There are still readable today. (Same can be said about avionics, space probes, etc.)
Now imagine yourself 40 years from now on, trying to fix or reverse engineer a very legacy system, you will have to recompile a journalctl from 40 years ago before being able to read anything.
There's a good chance that you'd be reading EBCDIC logs. :)
40 years from now, you will probably be able to invoke journalctl on the system and parse the dumped output as plain text. Or call gunzip on the compressed logs, $DEITY knows if we will be still using gzip by then. And if the system does not boot, you won't be able to connect the peripherals anywhere else... :)
There's no tool out there that generates log files it cant itself read. So there's not going to be any "oh gee I have these files being generated and nothing can read them" situation.
However, there is just about near-zero system out there that generates text logs that it can itself read. Text logs are write-only for most logging systems, while all binary logs I know of are read+write.
Stepping back though this entire argument is absurd. Thinking about "whatever will those people do 40 years from now with the tools of today" is fairly braindead once you understand that the quality of the tools will affect their longevity. So if the logging system becomes an actual, factual problem over time, the tools will die off by naturally-artificial selection.
I have already worked on very basic embedded system where you only way of getting logs is connecting to the device using a serial line, and after fiddling a bit with the baud rate, you can get some readable output.
In this case, you can't really do anything from the device itself.
Arguably, this is not the use case for binary logger but I was originally addressing the "40 years old logs" argument, that do exist in the real world.
> There's no tool out there that generates log files it cant itself read.
There are plenty of tools that don't read their logs - more precisely, computer units where you don't log in, units that you don't operate on console. Embedded devices that perform some function and also keep some log, but which cannot be used for reading that log. You will need to read that log using something else. Plain text (ASCII, and now ISO Latin and UTF-8) is a fairly stable format for everything, and will be for the next 50 years.
People usually read log files because something went wrong, like a system crash, why do you assume the OS that generated the log file will be readily available?
I've been producing a few services recently which output a chunk of JSON for each log message followed by a newline.
I think it actually solves most of the problems text logs have that binary don't (inability to easily present structured data, etc.) yet keeps the advantages of a text log (human readable, resistant to file corruption, future-proof).
Speaking for myself - multiline .json output is problematic, as most of the parsing tools work best when the data is on a single line, and it's a cognitive struggle to deal with multi-line output, even if you are clever with your tools. I usually have to end up writing a json parser in python to get the data into a format that I can manipulate it. (Thankfully, python does 95% of the work for you when reading a json file)
But - here is the thing, even though the .json format isn't convenient for me, I can, with about 20-30 minutes effort, write a parser that can get the data into a convenient format, because it started out as a text file.
If you're just grepping for a single word or phrase it really isn't much different to grepping regular logs.
If you're extracting structured data (e.g. getting the time stamp and a status code), it's actually easier than screwing around with awk and figuring out which exact column the time stamp finishes on and hoping that server #7 doesn't put it on a different column.
Well - to be clear, if I I run into a log file with it's data on a single line, 95% of the time it will take < 30 seconds to extract the data I need. If I run into a multi-line json file, trying to re-integrate all the data back into a single record will take me on the order of 30 minutes. (Mostly because I usually only do it once or twice a year, so I typically start from first principles each time. Multi-Line .json log files are very rare.)
95% of the time I just give up on the multi-line .json files - unless it's really, really critical, I probably don't want to spend 30 minutes writing code to re-assemble the data.
Text Log files, wherever possible, should capture their data on a single line. If they need to go multi-line, then having a transaction ID that is common among those lines, makes life easier.
.json files (or xml files), are an interesting halfway point between pure text, and pure binary. They aren't easily parseable without tools, but, if you have to, you can always write your own tools to parse them.
Maybe I am misunderstanding but it sounds like you are encountering bad json log file practices because json entries are spanning multiple lines. Which implies they are being printed in non compact form aka prettified. Thats a problem in the pure text world too. And hurts worse when it happens there. Its kind of an apples to oranges comparison.
Json log files should ideally print using compact form (which will never have raw newlines) so each entry only takes one line, which is then separated by a raw \n
If that practice is followed each line will represent the complete json object. So you can then pipe the file through jq, Perl, python etc one line at a time.
Printing prettified json to a log should be avoided because it then requires having to reconstitute individual events syntactically before being able to grep for an event. if pretty output is desired pipe it through a prettifier.
Config files are a different story, those should most definitely be pretty printed with one atom per line for nice diffability and the best read and editability json can offer. Sadly json for config files is, unfortunately, a bad idea if you want humans to enjoy editing them by hand. In that case using yml is the best option I have encountered (ansible).
I have no problem with json output in log files, but I would greatly prefer it be constrained to the message portion of a logline. At a minimum I generally want three things per line, a timestamp (in ISO 8601 or something close), a message type (info, warning, error, etc) or log entry source, and the message itself. I don't want to be looking into the JSON structure itself for a timestamp, especially when the field encoding the timestamp may be called something slightly different based on what generated the log...
In that respect, whether the message is JSON, or YAML, or XML doesn't matter, that can easily be worked on later, but the first thing I want to be able to do is filter by time and type.
>I don't want to be looking into the JSON structure itself for a timestamp
A) JSON parsers are relatively common and reliable.
B) The timestamp would be human readable even without the parser.
>especially when the field encoding the timestamp may be called something slightly different based on what generated the log...
I often come across logs that put timestamps in different places on the line and encode them differently (or don't output a timestamp at all, sometimes). This is no different to having to deal with a differently named JSON property.
My point is really around having the date be in a well defined place that isn't necessarily defined by the application that's logging. If the log entry date is at the beginning of the line, there's no ambiguity as to whether it's the log entry date or some other date being logged, and it also doesn't require parsing the JSON at all to filter by the date. If it's not at some very standard location that's easy to filer by (a possibly changing JSON property does not qualify), they it's hard to know you are filtering on the right data, and may also require transform before filtering. JSON parsers are fast. Multi-GB log files will still cause some extra overhead and slow the operation down, so it's best to reduce the working set before parsing the JSON.
>My point is really around having the date be in a well defined place that isn't necessarily defined by the application that's logging. If the log entry date is at the beginning of the line, there's no ambiguity as to whether it's the log entry date or some other date being logged, and it also doesn't require parsing the JSON at all to filter by the date.
Take this example:
1-1-15 1:1:1 Info Log message A
12-13-15 12:34:55 Debug Log message B
12-13-15 1:34:55 Error log message C
12-13-15 1:34:55Error log message D
[12-13-15 1:34:55]Error log message E
It doesn't require parsing JSON to get the date, you're right about that. It's harder than parsing JSON, though.
Note two replies of mine prior where I state ISO 8601 or similar. Also not where I said the json would be constrained to the message portion of the entry. Preferably there's a logging mechanism that takes care of that for you, so you can't screw up the timestamp and type portions of the entry. In that case, your entries become:
2015-01-01 01:01:01 Info Log message A
2015-12-13 12:34:55 Debug Log message B
2015-12-13 01:34:55 Error log message C
2015-12-13 13:34:55 Error log message D # let's assume that was 1 PM data for the sake of the example
2015-12-13 01:34:55 Error log message E
Getting the date is trivial. Getting the type is also trivial. Give a static field size to type and it's event more so. The point is, you abstract the message from the rest of it, so the message can't screw up the metadata of the entry, and log whatever you want for the actual message (xml, json, plain text, whatever, just no raw newlines). This is what we have today with syslog, sans newline replacement and a slightly different date format (but still unambiguous). It works. It's useful. It's VERY easy to filter type type or date. You can take the first X chars and split on space/whitespace if you need to. You can log a message of a few megabytes and if there's no raw newlines there's efficient utilities to ignore that until you have what you want (/bin/cut).
Not to disagree with you in any way, but `jq` is something you might look to add to your toolbox. As must JSON as we see anymore, it's a good tool to have.
Like you, I also deal with the kinda weird uwsgi logs. I feel like "universal format" probably didn't mean the format of all the lines in all the logs is the same - though your definition is probably more accurate.
Despite that, I can be pretty sure when I walk in to a foreign system there will be nginx logs, just where I expect them, almost certainly in the format I'm used to. And even if the format differs, it's not much of a problem. Binary logs, big problem.
Sure, on a site that uses ElasticSearch for its logs I would have no idea where to look at. I'd be more at ease with SQL, but first you need to locate the DB, figure out the schema, get the SQL dialect right.
That said, I'd be far more at ease writing a SQL query to extract analytics from logs than cooking up some regexes and doing complex stuff with awk.
And I find the --since/--until parameters to journalctl far easier than matching dates by regex. Or even the --boot parameter to restrict logs to a specific boot, which with would be probably doable with awk but definitely not as trivial.
I think that binary logs give you some compelling features, without taking away any: you can always just dump the logs on stdout and use grep as much as you want. :)
"Text logs" is not a format at all, so it can't really be a universal format, either. But if there were such a thing as a "universal" format, it would probably by definition encompass everything in time and space. You think timestamps are a problem? Just wait until your logs get trapped in a quantum state. Talk about a heisenbug...
Uh, I don't know what world you live in but I'd like the address because mine sucks in comparison.
Text logs are definitely not a "universal format". Easily accessible, sure. Human readable most of the time? Okay. Universal? Ten times nope.
Give you an example: uwsgi logs don't even have timestamps, and contain whatever crap the program's stdout outputs, so you often end up with three different types of your "universal format" in there. I'm not giving this example because it's contrived, but because I was dealing with it the very moment I read your comment.