Hacker Newsnew | past | comments | ask | show | jobs | submit | hack_ml's commentslogin

You will have to send one page at a time, most of this work has to be done via RAG. Adding a large context (like a whole PDF), still does not work that well in my experience.


The ablation studies and the dataset can be found here: https://www.zyphra.com/post/building-zyda-2


There are some integrations for stuff like https://docs.rapids.ai/visualization :

HoloViews hvPlot Datashader Plotly Bokeh Seaborn Panel PyDeck cuxfilter node RAPIDS


There is dask cudf which gets a lot of the way there.

https://docs.rapids.ai/api/dask-cudf/stable/


Last time I’ve used it, Dask was a lot worse than simple manual batching.


This is huge, this was my only gripe with cudf!


I did a conversion of 500GB of data using dask_cudf on a GTX 1060 with 6GB of VRAM and was able to do it faster than a 20 node m3.xlarge Cluster.

What you can do on even consumer GPU's is mind blowing.


How does it perform when it comes to plotting these large data points? Can I use matplotlib?


I was conversing with it in Hinglish (A combination of Hindi and English) which folks in Urban India use and it was pretty on point apart from some use of esoteric hindi words but i think with right prompting we can fix that.


Nvidia announces Nemotron-4 15B

introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks.


Its seamless to accelerate BERTOPIC on GPU's with cuML now with the latest release. (v0.10.0)

Checkout the docs at: https://maartengr.github.io/BERTopic/faq.html#can-i-use-the-...

All you need to do is below

    from bertopic import BERTopic
    from cuml.cluster import HDBSCAN
    from cuml.manifold import UMAP

    # Create instances of GPU-accelerated UMAP and HDBSCAN
    umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
    hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

    # Pass the above models to be used in BERTopic
    topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
    topics, probs = topic_model.fit_transform(docs)


On the SQL front there has been some active work to make that experience better with DASK.

See dask-sql: https://dask-sql.readthedocs.io/en/latest/pages/api.html


You can probably use https://github.com/rapidsai/cudf/tree/main/python/dask_cudf a dask wrapper around cuDF.


RAPIDS by NVIDIA has an equivalent API open source version of Sckit-Learn https://docs.rapids.ai/api/cuml/stable/ which seems to offer 100x speedup for a lot of these models.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: