Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Anyone else encounter Ollama Starter Example working once, then failing thereafter? #13188

Open
1 task done
130jd opened this issue May 1, 2024 · 2 comments
Open
1 task done
Labels
question Further information is requested

Comments

@130jd
Copy link

130jd commented May 1, 2024

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I've previously used a RAG system using Ollama x langchain, but decided to try Ollama x llama-index to see if I'd get better results. Started off with the Ollama Starter Example, got it to work once, and then thereafter it would always timeout (despite setting a 10+ min timeout).

More details: In these failures, I noticed that it wouldn't even progress to the embedding phase (where the progress bars load), and looking in task manager the ollama_llama_server.exe was consuming a ton of resources but nothing was happening until the timeout. When it timed out, the error message would mention map_httpcore_exceptions (similar to the logs here).

  1. Is it a problem with my machine specs? I have 16GB RAM and the Starter Tutorial says at least 32GB of RAM is required. If so, why did it work the first time? Also, others have run it on 16GB machines.
  2. Is it a problem specific to llama-index? I tried to use (through Ollama) the 4-bit quantized mistral and phi3 models, all of which have worked fine (by themselves and in the langchain RAG) on my machine. For example, llama-index has an in-memory vector store; is it possible that it worked the first time, then the in-memory vector store ran out of memory and thus failed on subsequent tries? Or is it a problem specific to the llama-index-llms-ollama library; some say using the Ollama library works?
  3. Is it a problem with Ollama itself? It sounds like some default context window mismatches are causing odd behavior. See here and here.

Does any of the above give any clues that help diagnose what's going on? I'm not just looking for a fix, but more trying to understand the principles of why llama-index may not be working for me. And if there's some fundamental problem with Ollama x llama-index (that isn't a problem with Ollama x langchain), such that if I want to use llama-index I need to ditch Ollama for OpenAI?

Thanks in advance.

@130jd 130jd added the question Further information is requested label May 1, 2024
Copy link

dosubot bot commented May 1, 2024

It sounds like you're encountering a complex issue with the Ollama x LlamaIndex integration, where resource consumption and timeouts are a significant concern. Here are some targeted insights and potential solutions to address the problems you're facing:

  1. Machine Specs and Resource Consumption: Although the Ollama Starter Example worked once on your 16GB RAM machine, the minimum requirement of 32GB RAM suggests that resource constraints might be causing the timeouts and excessive resource consumption you're observing. The initial success might have been due to lower resource demands for the first run or variations in the data being processed.

  2. Llama-Index Specific Issues: The in-memory vector store of Llama-Index could indeed be running out of memory on subsequent tries, especially if it's not releasing memory efficiently after each use. This aligns with your observation that it worked the first time but failed thereafter. The llama-index-llms-ollama library might have specific memory management or resource allocation patterns that differ from other setups you've used successfully.

  3. Ollama Configuration and Performance: Issues with default context window mismatches and the specific configurations for Ollama could be contributing to the odd behavior and failures. Adjusting the request_timeout parameter in the Ollama initialization to accommodate longer processing times might help, as well as ensuring that the Ollama server is adequately provisioned and optimized for performance.

  4. Troubleshooting Steps:

    • Incrementally increase the request_timeout parameter to see if it alleviates the timeout issues.
    • Monitor system resources closely during operation to identify any bottlenecks.
    • Consider implementing robust error handling and retry mechanisms to manage transient failures more gracefully.
    • Review and optimize the server and network configurations to ensure they are not contributing to the issues.

Additionally, a similar issue related to kernel crashing or CUDA OutOfMemory Error when running embedding models locally suggests that local models cannot run asynchronously due to being compute-bound. Lowering the batch size, for example, using HuggingFaceEmbedding(..., embed_batch_size=1), might reduce memory usage and alleviate some of the problems you're experiencing.

Understanding these principles and adjusting your setup accordingly could help mitigate the issues with the Ollama x LlamaIndex integration. If you continue to face challenges, providing more specific error messages or configurations could help in diagnosing and addressing the problem more effectively.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@logan-markewich
Copy link
Collaborator

I've never had or really seen this issue 🤔 you could decrease the context window on the llm to try and reduce memory usage if that feels like the issue.

LlamaIndex really tries to use the full context window always, to reduce llm calls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants