More

narush · on May 23, 2024

Check my comment to OP on this thread - hopefully this is answers why we don't do this!

I will say: we are considering some more approaches from traditional compilation / transpilation. I think these are very compatible!

narush · on May 23, 2024

Copying and pasting from a comment below -- we considered this heavily during our MVP, but in practice "moving an Excel process to Python" does not just mean executing an Excel file through Python. This pretty much doesn't replace your Excel dependence at all.

Consider the following (pretty easy) translation of a simple table:

    # This is not a useful or usable artifact; you're still trapped in Excel
    # but things are just worse now, since it's not on a 2d grid
    A1 = "Prices"
    A2 = 1
    A3 = 2
    B1 = "With Tax"
    B2 = SUM(A1, 10) * 1.3
    B3 = SUM(A2, 10) * 1.3

But this is ultimately is an awful solution for the user: 1. It’s 100% impossible to read. A large Excel file often has 100k+ formulas (many of them with shared structures). This is 100k+ lines of code... 2. It’s impossible to maintain. Yeah, since it lacks all semantic structure, there’s no f* way you’re going to test it or modify it.

To make it really concrete: you can't just transpile a SUMIF or VLOOKUP to a Python implementation of SUMIF or VLOOKUP, and you absolutely can't do this on a cell-by-cell basis.

Rather: we're trying to generate a Python script that appears to be written by an expert developer. To do this, you have to be willing to ditch the Excel formulas / execution engine, do more abstract reasoning over the file (like identifying tables / consistent formulas in columns and translating them as pandas dataframes), and translate them without just relying on matching the Excel exactly.

You want something closer to this:

    df = pd.DataFrame({'Prices': [1, 2]})
    df['With Tax'] = (df['Prices'] + 10) * 1.3

We want parity of outputs, not parity in how we get there!

jkaptur · on May 23, 2024

> We want parity of outputs, not parity in how we get there!

To be clear, though, you don't necessarily have parity of outputs.

It seems strange to me to sacrifice correctness for readable output. I would prefer a deterministic strategy that is always correct and sometimes readable. You could do that by generating an intermediate structure A1=... A2=..., then applying heuristics to say "hey, this enormous column of VLOOKUPs is actually a join", and so on. Maybe LLMs could advise on that, but I'm not sure how you'd check their work...

... Anyway you're the person "in the arena", having actually created something, so well done!

narush · on May 23, 2024

> To be clear, though, you don't necessarily have parity of outputs.

The cool thing is that the Excel file is both the programatic specification of the process as well as the actual output data you want as well. We can check parity of outputs by comparing the data we create with Python to the data in Excel - in practice, Pyoneer generates test cases for tables that do exactly this, even when we can't translate every formula correctly!

> applying heuristics to say "hey, this enormous column of VLOOKUPs is actually a join", and so on.

We do this deterministically currently. The only non-deterministic aspect is formula translation - where we defer to some LLM. Structurally, everything is deterministic though - and here we really do aim for readability (there's a lot more to do here though).

bko · on May 23, 2024

> we're trying to generate a Python script that appears to be written by an expert developer.

But you're not. You're just deferring to an LLM. Try to create some abstractions that can work. For instance there are libraries that unuglify code. Better to start from correct python and build from that than doing the whole shebang with an LLM black box

In your case of 100k formulas verifying the output is correct would be a monumental task especially if it's not deterministic and you can't have library tests. And the human readable format is much harder to deduce about and validate. And getting an answer for one a single input output pair is prone to over fitting esp if you do it recursively to have it fit errors

No offense, but I can't imagine anyone using this for anything serious.

narush · on May 23, 2024

Thanks for the link! We looked into existing libraries for Excel formula execution - this library as an awesome example!

We considered using this, but to copy from our MVP spec:

The easiest thing to do is to replicate Excel’s execution engine — you can see someone who has done this [here](https://pypi.org/project/xlcalculator/). But this is just evaluating Excel formulas is not what users want when transitioning a process to Excel - they want to be able to ditch Excel (mostly).

The next easiest thing would be to transpile Excel formulas to the following format:

    # This is not a useful or usable artifact; you're still trapped in Excel
    # but things are just worse now, since it's not on a 2d grid
    A1 = "Prices"
    A2 = 1
    A3 = 2
    B1 = "With Tax"
    B2 = SUM(A1, 10) * 1.3
    B3 = SUM(A2, 10) * 1.3

But this is ultimately is an awful solution for the user:
1. It’s 100% impossible to read. A large Excel file often has 100k+ formulas (many of them with shared structures). This is 100k+ lines of code...

2. It’s impossible to maintain. Yeah, since it lacks all semantic structure, there’s no f** way you’re going to test it or modify it.

In general: *we're trying to generate a Python script that appears to be written by an expert developer. To do this, you have to be willing to ditch the Excel formulas / execution engine.

narush · on May 23, 2024

You 100% own all Python code that you download from Pyoneer. Sorry for the lack of clarity here!

You're welcome to: 1. Edit it to fix up places where it can't translate fully. 2. Send it to your colleagues and tell them you wrote it (although... for your own personal morals... maybe don't :) ) 3. Upload it to your companies Github 4. Whatever the hell else you want

To be very clear: this is your code!

The code we generate currently has no dependencies on anything other than pandas, numpy, the Python core library, and the Excel file you uploaded in the first place. This might change as we try and support more Excel features (and so need to do some pivot table mocking), but we're not aiming to lock folks in!

P.S. If you do anything particularly interesting with the code you generate... tell me about it. nate @ sagacollab.com. I'd love to hear P.P.S. I'll update the language on this ASAP.

narush · on May 23, 2024

Totally on the roadmap, but not sure when yet!

The problem is data gets into these mega-excels through all sorts of funky routes... and I really do mean funky :)

1. PowerQuery: this is defined statically in the notebook so is detectable by Pyoneer. But I don't know a ton about the integration in Python here. I imagine this is doable.

2. Manual data entry: Pyoneer can't detect this from a static Excel file, really - what's the difference between the static Excel sheet and data updates ever time? Oftentimes, users with a lot of manual data entry to "automate this in Python" by turning an Excel file into like a form. Generating a proper web app out of the Excel file would be pretty sweet!

3. Database output copied - aka, copy in a table. This one is sometimes pretty crazy - I've seen Excel workbooks that have SQL queries just copied and pasted into a random cell in the notebook, so you can copy that and run it on some archaic SQL server. And then copy the output back in...

4. Macros: runs an API call, or an SQL query, or pulls (and then formats) data from another Excel sheet. Then put it in the right place. This then requires translating Macros - which are a whole programming language of their own. This is actually pretty high-priority for us right now, based on early feedback from developers who are in the thick of it with big Excel files.

6. Custom plugins. Big finance shops build/buy plugins that pull in data all the time! We haven't really started investigating how to handle these.

5. Other workbooks: at large banks, there's an additional dependency graph of workbooks that rely on eachother across the org. It's epic. There's a single workbook that defines all market holidays, that's used for all excel files that do performance reporting. And then these performance reports feed into other Excel's (by way of direct references, but also by way of copy and pasting, but also by way of uploading/downloading through a database). Support multiple Excel files at once is something we'll have to tackle eventually!

So... there's a lot to do here. We're really early - so we're focused on two primary things right now:

1. Solving the most pressing pain points first. Hence the early launch so we can talk to more folks and prioritize better. I've got a reasonable idea since I've done so much of this work myself, but every finance shop does things different...

2. Leaving good TODOs when we can't translate something. Currently, we can't translate pivot tables or complex formulas -- but we generate TODOs for these so you can go back and fill them in with the Python skills you have (and maybe the help of ChatGPT).

We're aiming to just give you a Python script. So if we don't translate the data pull how you want... you can just edit the notebook :)

narush · on May 23, 2024

Currently:

1. Data remains stored in the excel file. The generated script pulls the raw data directly from the notebook - but it's a single read_xlsx call. So if you want to switch it out for an API call, db read, whatever - it's easy to do so.

2. We model data as primitive Python data types, or, if it's a table, as a pandas dataframe.

Currently, we detect at most one table per sheet, and it's gotta be contiguous. These are pretty huge limitations we'll be relaxing soon -- but we wanted to get something out as soon as it would have been useful to one person -- and in it's current state, this would have helped me with some of my larger Excel automation projects :)

narush · on May 23, 2024

Yeah, we built this because we wished we had it! I've spent literally thousands of hours reimplementing Excel workbooks in Python as support for our previous shot at this problem - which was a spreadsheet that generates Python code as you edit it.

narush · on May 23, 2024

> What customer discovery have you done so far?

Me and my two cofounders spend the past 4 years working on Mito (https://trymito.io) -- where our customers are primarily large finance shops (including some bulge bracket banks you've heard of) that has a really concrete goal of getting users out of Excel and into Python. It's not every finance shop, but a quite a few are trying to make this transition. This usually means: multi-day Python trainings, a Python support team, a few developers who semi-full-time job is helping transition existing Excel processes to Python.

We built Mito to be a tool for the Excel-first users - we tried to make it easier for them to use their existing spreadsheet skills to write Python. But in working with the developers that support these new Python users, it became clear to us that there's a big pain point around:

1. I'm a dev who was given a big, old Excel file

2. It has a lot of business logic in it, understood by the person who made it, but not by me - who is tasked with turning it into real software

3. I have to spend 100s of hours: trying to understand the file, faithfully replicating the logic, and testing for consistency - to convert this to an Excel process.

I personally have been this developer in quite a few cases - just as support for Mito and helping these Excel users trying to transition to Python. Some Excel files literally take 300+ hours to "rebuild from scratch" in Python. It's often very engaging work, but brutally slow - so we're trying to automate as much as we can with Pyoneer.

> There's a vanishing window for stuff like this, if you're a Microsoft shop like 99% of the corporate world I think you are turning those excel files into power apps and powerBI dashboards, before you are hiring python devs.

I think this is a really fair point! We're not sure exactly what a reasonable business model really looks like, long-term. Right now, we're really focused on finding the developers for-which this is a big pain point, and seeing what we need to prioritize to make their lives better. I'm one of those developers...

kingkongjaffa · on May 23, 2024

Hey, thanks for the reply!

My next question would be how did Mito go?

What MMR did you get to?

How strong was the PMF?, 4 years seems like a long time to test this product, and I'm not sure if the overall market for this is too small. Is it the trap of dogfooding (building the thing you wish you had) without sizing the market?

Mito site says: Trusted by dozens of fortune 500 companies, how penetrated is that really? Is it one dev in each company on the free tier or entire departments/teams using this on the $150/month/user plan?

I built something similar (excel / python space ) but it was really just one feature as part of a larger platform, not something I would build a company around.

narush · on May 23, 2024

Sure thing - thanks for the good thoughts!

We're still working on Mito - it's not a retired product by any means. Pyoneer is just another stab at the same problem for a different user group.

MMR-wise, we scaled to profitability. PMF-wise, we have not reached this. We have large customers who make up the bulk of our revenue who love the product, and use it quite effectively as the basis for their entire Python program, but, transparently, scaling is hard!

> Is it the trap of dogfooding (building the thing you wish you had) without sizing the market?

Very possibly. But I think this is a much bigger pain point at large orgs with legacy processes than you realize, though. Every large bank has an entire development teams that are tasked with transitioning legacy processes out of spreadsheets. We're aiming to improve the efficiency of these developers dramatically.

For some of the spreadsheets I've personally automated, I think this would take 300 hours of work and make it like 5...

narush · on April 4, 2024

This is an excellent blog post - I'd never heard of Great Tables before, and I'm a newly minted fan!

> confronted with an all-too-familiar dilemma: copy your data into a tool like Excel to make the table, or, display an otherwise unpolished table.

One add-on (coming from the past 4 years of working on a tabular-data from Pythons startup [1]) is that users aren't just copying data into Excel because if it's good formatting capability: very often, there are organizational constraints that mean that Excel _needs_ to be where this data ends up.

The most common reasons I've seen for data ending up in Excel: 1. Other parts of the report rely on Excel features - you want to build pivot tables or graphs in Excel (often, these are much easier to build in Excel than in Python for anyone who isn't a real Pythonista) 2. The report you're sending out for display is _expected_ in an Excel format. The two main reasons for this are just organizational momentum, or that you want to let the receiver conduct additional ad-hoc analysis (Excel is best for this in almost every org).

The way we've sliced this problem space is by improving the interfaces that users can use to export formatting to Excel. You can see some of our (open-core) code here [2]. TL;DR: Mito gives you an interface in Jupyter that looks like a spreadsheet, where you can apply formatting like Excel (number formatting, conditional formatting, color formatting) - and then Mito automatically generates code that exports this formatting to an Excel. This is one of our more compelling enterprise features, for decision makers that work with non-expert Python programmers - getting formatting into Excel is a big hassle.

Of course, for folks who can ditch Excel entirely, this is entirely unnecessary. Great Tables seems excellent in this case (and anyone writing blog posts this good is probably writing good code too... :) )

[1] https://trymito.io

[2] https://github.com/mito-ds/mito/blob/dev/mitosheet/mitosheet...

NickFanion · on April 5, 2024

Playing nice with Excel (and PowerPoint) is an underrated feature. The next step I see from business users is taking the formatted Excel table and pasting it into a PowerPoint slide. The hacker mindset often says the Microsoft Office suite is the wrong tool for the job, so we should use X tool and Y process instead. That may be true, but there's so much institutional inertia at established organizations that it's hard to completely abandon the Office suite. Anything that lets a technical user do something programmatically, but allows the output to be easily manipulated by a non-expert is invaluable.

I've had success generating svg visuals and placing them in slides, which PPT treats as a "shape" (the Graphics Format ribbon appears), and business users like that they can modify the shapes (for example, change the color). Great Tables supports pdf export, but not svg. I just tested a pdf vector in the current version of PPT, and while it maintains the vector, PPT won't let me convert it to a shape (only the Picture Format ribbon is available). Great Tables doesn't seem to support svg export directly, so there needs to be an additional pdf -> svg conversion.

narush · on Feb 8, 2024

The OG Nick Bostrom argument [1] makes an argument for simulation theory with some reasonably simple math you can see in the paper with only a few variables:

- `f_p` - Fraction of all human-level technological civilizations that survive to reach a posthuman stage

- `f_I` - Fraction of posthuman civilizations that are interested in running ancestor-simulations

- `N_I` - Average number of ancestor-simulations run by a posthuman civilization

- `H` - Average number of individuals that have lived in a civilization before it reaches a posthuman stage

And then the formula for the fraction of observers with human-type experiences (after simplifying) is just: f_sim = (f_p * f_I * N_I) / (f_p * f_I * N_I + 1).

By first arguing that `N_I` is likely to be very large (because, pretty much, why not) -- you can thus conclude that one of these three conditions are met:

1. `f_p ~= 0` -- or the human species is very likely to go extinct before reaching a “posthuman” stage;

2. `f_I ~= 0` -- or any posthuman civilization is extremely unlikely to run a significant number of simulations of their evolutionary history (or variations thereof);

3. `f_sim ~= 1` -- or we are almost certainly living in a computer simulation.

It's all feels pretty similar to the Fermi paradox to me -- which I'm also suspicious of for reasons I can't justify properly. Something about point estimates for variables like `f_I` is... weird? Idk. I'm honestly not good enough at math to disagree - but it also feels like folks who are too good at math might be using `f_I` in equations in a way that isn't legitimate.

Like, assuming the "existence" of `f_I` as a concept to reason with -- doesn't it feel like more might be sneaking in with this assumption?

[1] https://simulation-argument.com

mshron · on Feb 8, 2024

One of Aaronson's arguments in the article boils down to the idea that running a full universe simulation (without cheating) on a universe with the same physics as ours may just not be possible computationally; it seems physically plausible that you need a universe to compute a universe.

If that's true, then the simulators would need to be running in a different kind of universe than ours... in which case "ancestor simulation" doesn't really make sense.

303uru · on Feb 8, 2024

Well, yes, quantum physics says that's true. Our universe is the minimal requirements necessary to run our universe. There could be tricks, for instance maybe you only simulate at fine grain the universe near an observer. But still, to simulate our granularity you need a simulation of equal (or greater) granularity. There's the trick, it's conceivable that the simulator exists in a universe with more dimensionality.

vagab0nd · on Feb 10, 2024

What if the simulation is the only thing in that universe?