hack_ml's comments

hack_ml · on March 6, 2025

You will have to send one page at a time, most of this work has to be done via RAG. Adding a large context (like a whole PDF), still does not work that well in my experience.

hack_ml · on Oct 16, 2024

The ablation studies and the dataset can be found here: https://www.zyphra.com/post/building-zyda-2

hack_ml · on June 3, 2024

There are some integrations for stuff like https://docs.rapids.ai/visualization :

HoloViews hvPlot Datashader Plotly Bokeh Seaborn Panel PyDeck cuxfilter node RAPIDS

hack_ml · on June 3, 2024

There is dask cudf which gets a lot of the way there.

https://docs.rapids.ai/api/dask-cudf/stable/

ashvardanian · on June 3, 2024

Last time I’ve used it, Dask was a lot worse than simple manual batching.

3abiton · on June 3, 2024

This is huge, this was my only gripe with cudf!

xs83 · on June 3, 2024

I did a conversion of 500GB of data using dask_cudf on a GTX 1060 with 6GB of VRAM and was able to do it faster than a 20 node m3.xlarge Cluster.

What you can do on even consumer GPU's is mind blowing.

iamcreasy · on June 3, 2024

How does it perform when it comes to plotting these large data points? Can I use matplotlib?

hack_ml · on May 13, 2024

I was conversing with it in Hinglish (A combination of Hindi and English) which folks in Urban India use and it was pretty on point apart from some use of esoteric hindi words but i think with right prompting we can fix that.

hack_ml · on Feb 27, 2024

Nvidia announces Nemotron-4 15B

introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks.

hack_ml · on May 11, 2022

Its seamless to accelerate BERTOPIC on GPU's with cuML now with the latest release. (v0.10.0)

Checkout the docs at: https://maartengr.github.io/BERTopic/faq.html#can-i-use-the-...

All you need to do is below

    from bertopic import BERTopic
    from cuml.cluster import HDBSCAN
    from cuml.manifold import UMAP

    # Create instances of GPU-accelerated UMAP and HDBSCAN
    umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
    hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

    # Pass the above models to be used in BERTopic
    topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
    topics, probs = topic_model.fit_transform(docs)

hack_ml · on Nov 18, 2021

On the SQL front there has been some active work to make that experience better with DASK.

See dask-sql: https://dask-sql.readthedocs.io/en/latest/pages/api.html

hack_ml · on Nov 18, 2021

You can probably use https://github.com/rapidsai/cudf/tree/main/python/dask_cudf a dask wrapper around cuDF.

hack_ml · on Nov 1, 2021

RAPIDS by NVIDIA has an equivalent API open source version of Sckit-Learn https://docs.rapids.ai/api/cuml/stable/ which seems to offer 100x speedup for a lot of these models.