Hacker Newsnew | past | comments | ask | show | jobs | submit | ianso's commentslogin

I think there could be a GH feature request that could do something like this in my opinion (opt-in though, not opt-out).

In my personal GH account there is a "sponsor" button that shows me what dependencies I have that I could sponsor. Unfortunately the list is empty.

My _organisations_ have hundreds of repo's, but there's no "sponsor" option at the org level in GH that says what dependencies the orgs use and then set up batch transactions at that level.

The dependency data already exists in dependabot for a lot of stuff, so it wouldn't be starting from scratch.


I think this stuff is super important, simply because there is a ton of stuff we can't do using our phones today.

Think mesh networking, resilient ad-hoc application clustering, non-Internet P2P, like Freifunk but everywhere. We shouldn't have to depend on Google or any of the big tech companies for anything except the hardware.

That would offer much more freedom. There are also contexts where this kind of thing could also enable life-saving applications. And unlike todays Internet where a database query in Cloudflare or a DNS bug in es-east-1 can disrupt half the services we use, this kind of technology really could withstand major attacks on infrastructure hubs, like the Internet was originally designed to do.


Twenty years ago, if you told me that by today we'd have smart phones with eight or more cores, each outperforming an average desktop computer of the time, with capacitive OLED touch screens, on a cellular network with hundreds of megabits of bandwidth, I'd find it believable, because that's where technology was headed at the time.

If you said that they'd effectively all be running either a port of OS X or a Linux distribution with a non-GNU but open source userspace, I'd consider that a somewhat unexpected success of open-source software. I would not at all expect that it would be as locked down as video game console.

The more time passes, the less I use my phone for, and the more likely I am to whip out my laptop to accomplish something, like it's 2005.


The open source components in your android phone are suffering from what FSF called "tivoization" a few decades ago. They can't reasonably be replaced without breaking security measures, a pretty high barrier for most users, even sometimes for advanced users. It removes the biggest benefits of being open source.


Open source userspace? Google Play Sevices?


>Think mesh networking, resilient ad-hoc application clustering, non-Internet P2P, like Freifunk but everywhere.

(if dumbed down) What's are the gaps in features and functionality between what you're describing and what might be achievable today (given enough software glue) with an SDR transceiver and something like Reticulum [1] on an Android?


Very good question!

SDR + something like Reticulum or Yggdrasil would definitely provide the infra or network fabric for the kind of thing I'm thinking of.

However, a normal Android, e.g. a Pixel 7, can't to my knowledge be turned into a web server or a podman host for containers. (I know of people hosting websites on old Androids that are flashed or hacked).

Given phones already have a WiFi/WLAN radio chip, it's a shame to need extra kit for connectivity.

It's something that's been on my mind a lot recently and so you provoked me into writing down a series of scenarios in story format that illustrates what SHOULD be possible using current hardware, were it not, as dlcarrier says, locked down like a games console.

Here you go:

https://ianso.blogspot.com/2025/11/what-we-dont-have.html


The dumbest part of this is that all Wikimedia projects already export a dump for bulk downloading: https://dumps.wikimedia.org/

So it's not like you need to crawl the sites to get content for training your models...


I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.

It's not clear which files you need, and the site itself is (or at least, was when I tried) "shipped" as some gigantic SQL scripts to rebuild the database with enough lines that the SQL servers I tried gave up reading them, requiring another script to split it up into chunks.

Then when you finally do have the database, you don't have a local copy of Wikipedia. You're missing several more files, for example category information is in a separate dump. Also you need wiki software to use the dump and host the site. After a weekend of fucking around with SQL, this is the point where I gave up and just curled the 200 or so pages I was interested in.

I'm pretty sure they want you to "just" download the database dump and go to town, but it's such a pain in the ass that I can see why someone else would just crawl it.


> I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.

More recently they starting putting the data up on Kaggle in a format which is supposed to be easier to ingest.

https://enterprise.wikimedia.com/blog/kaggle-dataset/


More recently is very recently, not enough time yet for data collectors to evaluate changing processes.


Good timing to learn about this, given that it's Friday. Thanks! I'll check it out


I think there are engineers working for crawler companies who are paid well enough to figure out how to do this without kneecapping the most well-known noncommercial projects still surviving on the capitalized internet.


Have you tried any of the ZIM file exports?

https://dumps.wikimedia.org/kiwix/zim/wikipedia/


Yeah, it's a bit confusing at first to navigate. Luckily, they offer XML dumps that aren't too bad to work with:

1. Go to https://dumps.wikimedia.org/enwiki/latest/ (or a date of your choice in /enwiki)

2. Download https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page... and https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.... The first file is a bz2-multistream-compressed dump of a XML containing all of English Wikipedia's text, while the second file is an index to make it easier to find specific articles.

3. You can either:

  a. unpack the first file
  b. use the second file to locate specific articles within the first file; it maps page title -> file offset for the relevant bz2 stream
  c. use a streaming decoder to process the entire Wiki without ever decompressing it wholly
4. Once you have the XML, getting at the actual text isn't too difficult; you should use a streaming XML decoder to avoid as much allocation as possible when processing this much data.

The XML contains pages like this:

    <page>
      <title>AccessibleComputing</title>
      <ns>0</ns>
      <id>10</id>
      <redirect title="Computer accessibility" />
      <revision>
        <id>1219062925</id>
        <parentid>1219062840</parentid>
        <timestamp>2024-04-15T14:38:04Z</timestamp>
        <contributor>
          <username>Asparagusus</username>
          <id>43603280</id>
        </contributor>
        <comment>Restored revision 1002250816 by [[Special:Contributions/Elli|Elli]] ([[User talk:Elli|talk]]): Unexplained redirect breaking</comment>
        <origin>1219062925</origin>
        <model>wikitext</model>
        <format>text/x-wiki</format>
        <text bytes="111" sha1="kmysdltgexdwkv2xsml3j44jb56dxvn" xml:space="preserve">#REDIRECT [[Computer accessibility]]

    {{rcat shell|
    {{R from move}}
    {{R from CamelCase}}
    {{R unprintworthy}}
    }}</text>
        <sha1>kmysdltgexdwkv2xsml3j44jb56dxvn</sha1>
      </revision>
    </page>
so all you need to do is get at the `text`.


The bigger problem is this is wikitext markup. It would be helpful if they also provide HTML and/or plain text.

I know there are now a couple pretty-good wikitext parsers, but for years, it was a bigger problem. The only "official" one was the huge php app itself.


Oh, it's godawful; the format is a crime against all things structured. I use `parse-wiki-text-2` [0], which is a fork of `parse-wiki-text`, a Rust library by an author who has now disappeared into the wind. (Every day that I parse Wikipedia, I thank him for his contributions, wherever he may be.)

I wrote another Rust library [1] that wraps around `parse-wiki-text-2` that offers a simplified AST that takes care of matching tags for you. It's designed to be bound to WASM [2], which is how I'm pretty reliably parsing Wikitext for my web application. (The existing JS libraries aren't fantastic, if I'm being honest.)

[0]: https://github.com/soerenmeier/parse-wiki-text-2

[1]: https://github.com/philpax/wikitext_simplified

[2]: https://github.com/genresinspace/genresinspace.github.io/blo...


What they need to do is have 'major edits' push out an updated static render physical file like old school processes would. Then either host those somewhere as is, or also in a compressed format. (E.G. compressed weekly snapshot retained for a year?)

Also make a cname from bots.wikipedia.org to that site.


This probably is about on-demand search, not about gathering training data.

Crawling is more general + you get to consume it in its reconstituted form instead of deriving it yourself.

Hooking up a data dump for special-cased websites is much more complicated than letting LLM bots do a generalized on-demand web search.

Just think of how that logic would work. LLM wants to do a web search to answer your question. Some Wikimedia site is the top candidate. Instead of just going to the site, it uses this special code path that knows how to use https://{site}/{path} to figure out where {path} is in {site}'s data dump.


Yeah. Much easier to tragedy-of-the-commons the hell out of what is arguably one of the only consistently great achievements on the web...


> This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models.

Sounds like the problem is not the crawling itself but downloading multimedia files.

The article also explains that these requests are much more likely to request resources that aren't cached, so they generate more expensive traffic.


I need to work with the dump to extract geographic information. Most mirrors are not functioning, take weeks to catch up or block, or only mirror english wikipedia. Every other month I find a work-around. It's not easy to work with the full dumps, but I guess/hope easier than crawling wikipedia website itself.


Why use screwdriver when you have sledge hammer and everything is a nail?


nAIl™ - the network AI library. For sledgehammering all your screwdriver needs.


I thought that as well but maybe this is more for indexing search engines? In which case you want more realtime updates?


I don't see an obvious option to download all images from Wikipedia Commons. As the post clearly indicates, the text is not the issue here, its the images.


it seems like Wikimedia Foundation has always been protective of the image downloads since the 90s. So many drunken midnight scripters or new urban undergrad CEOs discovers that they can download cool images fairly quickly. AFAIK there has always been some kind of text corpus available in bulk because it is part of the mission of Wikipedia. But the image gallery is big on disk, big bandwidth compared to TEXT, and low hanging target for the uninformed, greedy and etc.


The Wikimedia nonfree image limitations have been a pain in my ass for years.

For those unfamiliar: The images that are marked NonFree must be smaller than 1Megapixel. 1155 X 866. In practice, 1024x768 is around the maximum size.


This is what torrents are built for.

A torrent of all images updated once a year would probably do quite well.


Provided you have enough seed nodes - not free.


I have excess bandwidth in various places - would be happy to seed.


No offence to Mr. Andserson but it reads a bit like someone has done an 'Inception' on him - subtly planted the seed of a train of thought that would lead him to disband his efforts, all the while believing it was his own idea.


Does anyone else think this should be made into a movie?

Massive respect to this ED.


"ER slammed with patients, doctors improvise solutions and everyone is saved" is a common episode plot for TV medical dramas.


Popping up here to mention OpenMRS, a healthy open-source EMR used by hundreds if not thousands of facilities across the world, mostly in Africa and Asia. An old version is packaged/integrated with some other apps into Bahmni, which is a full-blown hospital management system.

Honestly, people complain about software all the time and all software sucks to some extent, some more than others of course. Complaining about all EMR just because the USA bodged a national rollout in their uniquely messed up healthcare situation is a bit myopic.


I would not dismiss his complaints as broad brush IT disparagement. Despite the title, I think the author is referring to the EHR (Electronic Health Record), specifically of Epic which is unique in that it is essentially a "benevolent tech dictatorship" imposed upon 90% of hospitals through owner/lead programmer Judith Faulkner at Epic (no medical background) lured initially by federal incentives then married to it in perpetuity. That's not really anything like operating system software you can uninstall today (e.g. Microsoft, Linux, Apple). With the onerous hospital system transformation contracts and non-disclosure agreements - Epic exec customers are more like cult followers than standard software consumers.


> Judith Faulkner at Epic (no medical background)

I suppose that's technically true although wikipedia states that she cofounded Epic with a medical doctor and is married to a medical doctor. As a software developer, I rely on subject matter experts frequently. I'm not an expert in radiology, chemical engineering, or any of the other businesses that I support.

https://en.wikipedia.org/wiki/Judith_Faulkner


Very true, people complain about software, but my recent experience in a Gulf country, was stellar, all my medical records, were entered properly including procedures, it takes approximately 12 minutes to get medicine from the pharmacy and can view the records online. The patient interface could be improved a bit but did its job perfectly. The article was stellar, as it summarized the state of the art and the issues well.


I believe this article is only about this specific system. The problem is that the name of the system is "Electronic Medical Record", which makes the article sound like all such systems are hated.


I don't think that's the case. AFAIK there's no such system. The article specifically reference Epic, which is one of the biggest (_the_ biggest?) EHR software company in the US. My wife used it as a nurse and I've watched my own providers struggle with it. It sucks.


EPIC is also used outside of the US too, there are famous botched implementations of it in Norway and Finland. It’s not just a USA problem


Indeed. The biggest hospital where I live (Trondheim, Norway) is still running at reduced capacity and over budget a year after moving to an Epic derivative. Doctors and nurses are burning out, many talking about moving elsewhere. Being recently retired, I am concerned that this may not be a safe place to grow old, if the hospital (by all accounts an excellent one) is dragged down by this horror of a system.


Saying “it sucks” is like saying “SAP sucks”. Whatever system you get out of one of these projects is 99% dependent on the project team (on both sides), and 1% on the product base, which is invariably “meh”


If this is the case, the name of the system needs proper capitalization in the title.


All such systems are hated by most users. Source: I work in the industry.


Funny that the article focuses on huge corps for on-prem. I work for a non-profit and we deploy stuff to hardware in Afghanistan, Haiti, Sudan... You can't assume the Internet works and if it does, it could be a 1Mbps VSAT connection.

Needless to say we don't have a Fortune 500 budget. When I see a nice, affordable app that says "and here's a Docker image!" next to the SaaS options, I am like ( ˘ ³ ˘ ) ♥


Love this! See also ssllabs.com and sshcheck.com.


This reminds me of "two stories of the pistol", some anecdotes where a sentry actually does almost kill some idiot senior officers:

http://everything2.com/title/Two+stories+of+the+pistol


My father's ship was patrolling off the cost of south vietnam. They were told that civilian Vietnamese fishing boats traveling east or west were fishing, whereas those going north or south were running ammunition or troops.

So one day they spotted a sailboat going south. The request permission to shell it. The permission was granted.

Dad thought it was quite funny, the shower of fish in the air when the shell detonated.

It never occured to me until many years later, to wonder about the crew aboard the fishing boat. It was quite common for entire families to live on their fishing boats for their whole lives.


Sorry but what is the actual point of this blog post?

GPG is just one guy. Who's practically beggared himself writing and maintaining the tool.

GPG is actually used by human rights activists, journalists etc. That, right there, is reason enough to celebrate it and NOT "kill it off".

I think the massive pile-on this is creating is really dumb, to be honest. So Moxie thinks it could be done better; that's great. He's good enough that he can "show, not tell".

Why waste time denigrating a project that's basically a labour of love for one guy that is actually tremendously important, even if it's "90's technology"? Old doesn't necessarily mean bad.


> Why waste time denigrating a project that's basically a labour of love for one guy that is actually tremendously important, even if it's "90's technology"? Old doesn't necessarily mean bad.

In the world of crypto, where we've learned so much, yes old means bad. Almost always.

Why denigrate GPG? Unfortunately, because the message that it's not good isn't being widely heard.

How many NEW crypto projects are being created that start out by saying, "first we will use GPG"? I've seen lots. OK, you failed right there, right at the start. Don't do that.

How many crypto geeks STILL spout rubbish about how the PKI is totally busted and the web of trust is the future? Way too many. WoT is sort of like the year of desktop Linux by now. It's just a bad joke that too many people won't let go of.

The most serious and effective applied cryptographers I know about are all ignoring GPG and rolling new modern crypto protocols. I feel the same way as Moxie - if you build a product based on GPG then almost immediately you are less interesting than a project that's doing something new.

And FWIW I have the same sinking feeling when I get a GPG encrypted email. Sometimes I don't read it immediately, I put it off. Sometimes I have to put it off because I'm not near my laptop. And when I decrypt it, inevitably I discover that I could have guessed the contents of the mail from the subject line and identity of the sender. The encryption was largely pointless to begin with.

The future of encrypted messaging is not GPG. We need to collectively let it go.


It's not about GPG sucking; it's about the absence of anything sucking less than GPG.

It's not about activists and journalists being (more or less) able to use GPG; it's about the fact that nobody who doesn't face as deadly a risk as them would bother to use GPG.

I didn't feel any denigration reading him; rather, the statement that:

* We have new crypto needs, in wake of revelations such as Snowden's;

* GPG isn't an adequate answer to those needs, and isn't likely to evolve into one;

* Tech people don't realize that GPG is unlikely to morph into an adequate solution, and therefore don't bother starting an alternative.

Finally, I believe that a successful answer would rely on excellent UX and PR at least as much as sound crypto. I'm not aware that Moxie is an expert in these fields (although he might have more talents than I know), so it's not obvious that he's in a position of showing rather than telling.


We need very good quality encryption software that is not really hard to use.

PGP is impossible for many people to use correctly. That means there is a bunch of -- often insecure -- software to fill the gap.

So the people who really need PGP/GPG have to struggle to use it and don't know if they've managed it or they use some other software instead that probably doesn't protect them.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: