llm-evaluation

Here are 55 public repositories matching this topic...

promptfoo / promptfoo

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jun 4, 2024
TypeScript

Giskard-AI / giskard

Sponsor

Star

🐢 Open-Source Evaluation & Testing for LLMs and ML models

Updated Jun 4, 2024
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated Jun 4, 2024
TypeScript

Agenta-AI / agenta

Star

The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.

prompt-toolkit rag human-annotation large-language-models llm prompt-engineering llms langchain llmops llama-index prompt-management llm-tools llm-framework llm-evaluation rag-evaluation

Updated Jun 4, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Jun 3, 2024
Python

innerNULL / summary-evaluator

Star

Summary Evaluation Tool

nlp deep-learning text-summarization model-evaluation model-evaluation-metrics llm bertscore llm-evaluation

Updated Jun 3, 2024
Python

Psycoy / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture dynamic-benchmark

Updated Jun 3, 2024
Python

confident-ai / deepeval

Star

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Jun 3, 2024
Python

PetroIvaniuk / llms-tools

Star

A list of LLMs Tools & Projects

data-science machine-learning ai chatbots chat-bot llm chatgpt open-source-llm llm-evaluation

Updated Jun 2, 2024

relari-ai / continuous-eval

Star

Open-Source Evaluation for GenAI Application Pipelines

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

Updated Jun 2, 2024
Python

nagababumo / Building-and-Evaluating-Advanced-RAG

Star

python rag llamaindex retrieval-augmented-generation llm-evaluation llm-evaluation-framework

Updated Jun 1, 2024
Jupyter Notebook

parea-ai / parea-sdk-ts

Star

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm prompt-engineering llms llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated May 31, 2024
TypeScript

loganrjmurphy / LeanEuclid

Star

LeanEuclid is a benchmark for autoformalization in the domain of Euclidean geometry, targeting the proof assistant Lean.

theorem-proving formalization euclidean-geometry lean4 llm-evaluation autoformalization

Updated May 31, 2024
Lean

deshwalmahesh / PHUDGE

Star

Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.