Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df property of AtlasMapTopics does not include topic ids, only topic labels #188

Open
michael4tasman opened this issue Jun 29, 2023 · 4 comments

Comments

@michael4tasman
Copy link

    @property
    def df(self) -> pandas.DataFrame:
        """
        A pandas dataframe associating each datapoint on your map to their topics as each topic depth.
        """
        return self.tb.to_pandas()

    @property
    def tb(self) -> pa.Table:
        """
        Pyarrow table associating each datapoint on the map to their Atlas assigned topics.
        This table is memmapped from the underlying files and is the most efficient way to
        access topic information.
        """
        return self._tb
print(topic_data.df[0:1])

      id topic_depth_1       topic_depth_2  topic_depth_3
0  18963  Music videos  Youtube, bitchute,  youtube video
@AndriyMulyar
Copy link
Contributor

AndriyMulyar commented Jun 29, 2023 via email

@michael4tasman
Copy link
Author

This brings a good question, would you prefer the IDs or labels in the data
frame representation?

Depends on the use case. If you want to find the labels for a given datum, then you want the labels in the data frame. If you want to find the datums matched to a given topic, then you want the IDs.

I think given that it is a data frame, there shouldn't be any obstacle to simply adding a column for the topic ID to the current representation? That way one representation can be used for either use case.

Also, what do you think of the new state access patterns? Do they make sense?

There was a fair amount of inconsistency in the previous API. The new data frames as properties has the advantage of consistency, and leveraging the pandas ecosystem. It has the disadvantage of an additional learning hurdle for developers who are coming from plain vanilla Python without pandas experience. I think this can be mitigated with good step-by-step documentation, and I like the idea of re-doing the documentation as .ipynb files.

@AndriyMulyar
Copy link
Contributor

You left a previous issue on 1.x about accessing datapoints by topics: #183

Are you still facing this.
You should be able to do:

from nomic import AtlasProject

project = AtlasProject(name='My Project')
map = project.maps[0]
print(map.topics.group_by_topic(3))

@michael4tasman
Copy link
Author

michael4tasman commented Jun 30, 2023

No, I still get the same error as #183 using the 2.0.0 version of the library:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[23], line 3
      1 project = atlas.AtlasProject(project_id=PROJECT)
      2 map = project.maps[0]
----> 3 print(map.topics.group_by_topic(3))

File ~/Minds/projects/topical/nomic/lib/python3.10/site-packages/nomic/data_operations.py:257, in AtlasMapTopics.group_by_topic(self, topic_depth)
    254 result_dict = {}
    255 topic_metadata = topic_df[topic_df["topic_short_description"] == topic]
--> 257 subtopics = hierarchy[topic]
    258 result_dict["subtopics"] = subtopics
    259 result_dict["subtopic_ids"] = topic_df[topic_df["topic_short_description"].isin(subtopics)][
    260     "topic_id"
    261 ].tolist()

KeyError: 'Cloud and Server Hosting'

However, with the topic id included in the data frame, as discussed above, the group_by_topic() function becomes redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants