KnowledgeGraph building using KnowledgeGraphIndex is very slow #8433

rajendra-t · 2023-10-11T05:13:39Z

rajendra-t
Oct 11, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

Hi Team,
We have observed very slow performance while building KG for email data (extracted in txt format) vs building Vector Index. We have tried to run this as K8 Jobs but no improvement in performance.

The current throughput to build KG using KnowledgeGraphIndex is 500 emails per hour.
Please note building Vector Index using llama index for the same set of email text data takes - 2500 mails per hour
(email is already deduplicated and curated - contains about 500 words in average. )

Below are the details of configuration -
llm - Azure Open AI (gpt-35-turbo for llm and text-embedding-ada-002 for embedding)
Graph DB - Nebula v3.6 installed on K8
llama-index==0.8.30
pymilvus==2.3.1

Please suggest if there is way to improve the performance of KG building

logan-markewich · 2023-10-11T05:15:53Z

logan-markewich
Oct 11, 2023
Maintainer

The vector index calls embeddings during index construction (very fast, can batch calls)

The knowledge graph index calls the LLM bu default (very slow, sequential)

Hence, the speed difference

Batching isn't always supported by LLMs, and even when it is, rate limits are also an issue. It would take a careful PR to fix this

0 replies

@dosu-bot · 2023-10-11T05:21:50Z

dosubot[bot]
bot Oct 11, 2023

🤖

Hi there,

Thank you for reaching out and providing detailed information about your issue.

To improve the performance of KnowledgeGraph building using KnowledgeGraphIndex, you might want to consider the following strategies:

Adjust the max_triplets_per_chunk parameter: This parameter controls the maximum number of triplets to extract from each chunk of text. Reducing this number can speed up the process as fewer triplets will need to be processed.
Change the include_embeddings parameter: This parameter controls whether to include embeddings in the index. If it's currently set to True, you might see a performance improvement by setting it to False, as generating embeddings can be computationally intensive.
Adjust the max_object_length parameter: This parameter controls the maximum length of the object in a triplet. Reducing this number can speed up the process as it will skip triplets with objects that are too long.
Use a custom kg_triplet_extract_fn function: If you have a more efficient function for extracting triplets, you can use it to replace the default function.

Here is an example of how you might adjust these parameters when initializing the KnowledgeGraphIndex:

kg_index = KnowledgeGraphIndex(
    service_context=service_context,
    storage_context=storage_context,
    max_triplets_per_chunk=5,
    include_embeddings=False,
    max_object_length=100,
    kg_triplet_extract_fn=my_custom_extract_fn,
)

Please note that while these changes can improve performance, they may also affect the quality of the results. You should therefore test these changes thoroughly to ensure they meet your needs.

In the context shared, it's also important to note that not all cases will benefit from the use of a Knowledge Graph. For example, if the question does not contain small-grained pieces of knowledge in chunks, the extra Knowledge Graph retriever may not be helpful. Therefore, it's crucial to evaluate the nature of your data and the type of queries you're dealing with before deciding on the best indexing strategy.

As for the specific performance differences between the KnowledgeGraphIndex and the LlamaIndex, I wasn't able to find an answer within the repository. The performance can vary greatly depending on the specific use case, data, and configuration.

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

@dosu-bot · 2023-10-24T06:41:12Z

dosubot[bot]
bot Oct 24, 2023

🤖

Hello,

Thank you for reaching out with your concern about the performance of KnowledgeGraph building using KnowledgeGraphIndex.

To improve the performance of KnowledgeGraph building, you can consider adjusting some parameters in your KnowledgeGraphIndex configuration. Here are some suggestions:

Adjust the max_triplets_per_chunk parameter: This parameter controls the maximum number of triplets to extract per chunk. Reducing this number can speed up the process as fewer triplets will need to be processed.
Change the include_embeddings parameter: This parameter controls whether to include embeddings in the index. If it's currently set to True, you might see a performance improvement by setting it to False, as generating embeddings can be computationally intensive.
Adjust the max_object_length parameter: This parameter controls the maximum length of the object in a triplet. Reducing this number can speed up the process as shorter objects will need to be processed.

Here is an example of how you might adjust these parameters when initializing the KnowledgeGraphIndex:

index = KnowledgeGraphIndex(
    service_context=service_context,
    storage_context=storage_context,
    max_triplets_per_chunk=5,
    include_embeddings=False,
    max_object_length=64,
)

Please note that while these changes can improve performance, they may also affect the quality of the results. You should therefore test these changes thoroughly to ensure they meet your needs.

In terms of the specific performance differences between the KnowledgeGraphIndex and the LlamaIndex, it's important to note that the performance can be influenced by the complexity of the data and the relationships between entities. If the data contains many complex relationships, the process of extracting these relationships and building the KnowledgeGraph can be slower. Therefore, another strategy to improve performance could be to simplify the data or the relationships between entities, if possible.

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

cmxwu · 2024-05-07T15:40:00Z

cmxwu
May 7, 2024

I have the same issues here. I experience token time out because this KnowledgeGraphIndex is taking more than 2hrs. Can you recommend the file size limit if we are using llm for triplet extractions? Thank you

0 replies

cmxwu · 2024-05-07T15:44:49Z

cmxwu
May 7, 2024

In fact, this process takes rather long. Is it possible to save check points for KnowledgeGraphIndex? Thank you once again!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KnowledgeGraph building using KnowledgeGraphIndex is very slow #8433

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

KnowledgeGraph building using KnowledgeGraphIndex is very slow #8433

rajendra-t Oct 11, 2023

Question Validation

Question

Replies: 5 comments

logan-markewich Oct 11, 2023 Maintainer

dosubot[bot] bot Oct 11, 2023

Sources

dosubot[bot] bot Oct 24, 2023

Sources

cmxwu May 7, 2024

cmxwu May 7, 2024

rajendra-t
Oct 11, 2023

logan-markewich
Oct 11, 2023
Maintainer

dosubot[bot]
bot Oct 11, 2023

dosubot[bot]
bot Oct 24, 2023

cmxwu
May 7, 2024

cmxwu
May 7, 2024