topics / labels #218

text2sql · 2023-09-26T13:36:03Z

I could not find in the Documentation, how labels / topics are created. Automatically? Based on the indexed_field? Can you pls point me to the right source? Or should I create a separate field, labeling each 'text'? Thanks for your great product !

AndriyMulyar · 2023-09-26T13:45:28Z

Hi @text2sql, please check out the 'Reading an Atlas Map' section of the documentation:
https://docs.nomic.ai/how_does_atlas_work.html#reading-an-atlas-map

The topic labels are automatically generated based on the textual contents of either the field you index on or a field that you specify is the topic_label_field (see create_index or map_text/map_embeddings documentation)

text2sql · 2023-09-26T13:54:17Z

Thanks Andriy, I think I understand the concept. I am pulling vectors from Pinecone. Please see attached an example of a vector. Should I add 'Metadata' to topic_label_field, in here atlas.map_embeddings(embeddings=embeddings, data=[{'id': id} for id in ids], id_field='id', topic_label_field='Metadata'), will it label with 1, 2, 3 etc or come up with the most relevant topic name?

I came across this article https://andriymulyar.com/blog/bert-document-classification , you were a couple of steps ahead :-)

AndriyMulyar · 2023-09-26T13:59:42Z

Wow that is an old blogpost :)

Yes thats the right syntax. You want to also tell Atlas to build topics on your data.

atlas.map_embeddings(
embeddings=embeddings,
data=[{'id': id} for id in ids],
id_field='id',
build_topic_model=True,
topic_label_field='Metadata'
)

It's possible to run Atlas topic modeling on your data but not auto-generate the labels that float over the map so thats why their are two parameters.

text2sql · 2023-09-26T14:01:40Z

don't i need indexed_field too?

text2sql · 2023-09-26T14:03:34Z

the same error if I use 'Metadata' or 'text' in topic_label_field

text2sql · 2023-09-26T14:06:56Z

rguo123 · 2023-09-26T14:52:12Z

Hey @text2sql! hopping in the discussion here. Looks like the issue here is the topic label fields aren't being uploaded as part of your data (right now data just contains id)

Need something like (pseudocode below):

atlas.map_embeddings(
embeddings=embeddings,
data=[{'id': id, 'text': text} for id, text, in (ids, texts)],
id_field='id',
build_topic_model=True,
topic_label_field='text'
)

text2sql · 2023-09-26T15:08:55Z

Thank you. I used the below code, it seemed to work. Kinda :-) If you read some case facts you will see the labeling is off. Does it improve if I upload more vectors? say 10K or 100K?
Can I somehow prime / prompt the labeling model?

https://atlas.nomic.ai/map/26df09cc-9f48-42ef-9557-2f56a6fd18be/8be0f3c2-9b7f-43bb-b7e3-dcabf53346df

num_embeddings = 1000

Fetch the first 1000 embeddings

vectors = index.fetch(ids=[str(i) for i in range(num_embeddings)])

Initialize lists to store IDs, embeddings, and texts

ids = []
embeddings = []
texts = []

Extract IDs, embeddings, and texts from the fetched data

for id, vector in vectors['vectors'].items():
ids.append(id)
embeddings.append(vector['values'])
texts.append(vector['metadata']['text']) # Extracting text from METADATA

Convert embeddings to a numpy array

embeddings = np.array(embeddings)

Check if texts and ids are of the same length

assert len(texts) == len(ids), "Mismatch in lengths of ids and texts"

Map embeddings to Nomic Atlas

atlas.map_embeddings(
embeddings=embeddings,
data=[{'id': id_val, 'text': text_val} for id_val, text_val in zip(ids, texts)],
id_field='id',
build_topic_model=True,
topic_label_field='text'
)

text2sql · 2023-09-26T15:10:52Z

sorry, forgot to ask, do you label each vector with only one label? e.g. a school argues over property tax, will it be labeled as "school" or "taxation" or both?

rguo123 · 2023-09-26T15:23:53Z

More vectors should improve the labels as it'll give the model more data to generate the topics. I'd def give both 10k and 100k a try!

Right now, we don't support user-side prompting/priming for the topic model.
Each embedding gets assigned one topic label but each topic label contains a longer description that you can fetch and look into:

from nomic import AtlasProject
map = AtlasProject(name='My Project').maps[0]
map.topics.metadata

text2sql · 2023-09-26T15:24:56Z

seems like i am limited by 1k only

text2sql · 2023-09-26T15:29:58Z

sorry for unlimited number of questions today... since you mentioned the more vectors the better , does it mean that you use some equivalent of KNN to group the points and then label as the group and not individually?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topics / labels #218

topics / labels #218

text2sql commented Sep 26, 2023

AndriyMulyar commented Sep 26, 2023

text2sql commented Sep 26, 2023

AndriyMulyar commented Sep 26, 2023

text2sql commented Sep 26, 2023

text2sql commented Sep 26, 2023 •

edited

text2sql commented Sep 26, 2023

rguo123 commented Sep 26, 2023

text2sql commented Sep 26, 2023

text2sql commented Sep 26, 2023 •

edited

rguo123 commented Sep 26, 2023 •

edited

text2sql commented Sep 26, 2023

text2sql commented Sep 26, 2023

topics / labels #218

topics / labels #218

Comments

text2sql commented Sep 26, 2023

AndriyMulyar commented Sep 26, 2023

text2sql commented Sep 26, 2023

AndriyMulyar commented Sep 26, 2023

text2sql commented Sep 26, 2023

text2sql commented Sep 26, 2023 • edited

text2sql commented Sep 26, 2023

rguo123 commented Sep 26, 2023

text2sql commented Sep 26, 2023

Fetch the first 1000 embeddings

Initialize lists to store IDs, embeddings, and texts

Extract IDs, embeddings, and texts from the fetched data

Convert embeddings to a numpy array

Check if texts and ids are of the same length

Map embeddings to Nomic Atlas

text2sql commented Sep 26, 2023 • edited

rguo123 commented Sep 26, 2023 • edited

text2sql commented Sep 26, 2023

text2sql commented Sep 26, 2023

text2sql commented Sep 26, 2023 •

edited

text2sql commented Sep 26, 2023 •

edited

rguo123 commented Sep 26, 2023 •

edited