evaluation

Here are 1,114 public repositories matching this topic...

langfuse / langfuse

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated Jun 10, 2024
TypeScript

llyx97 / TempCompass

Star

[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, Lu Hou

evaluation temporal-perception video-llms

Updated Jun 10, 2024
Python

HarryBleckert / moodle-mod_evaluation

Star

Moodle plugin for evaluations with Moodle. This is the evaluation activity plugin.

evaluations evaluation moodle moodle-activity moodle-plugin evaluation-kit lehrveranstaltungsevaluationen evaluations-with-moodle

Updated Jun 10, 2024
PHP

ncalc / ncalc

Star

Mathematical Expressions Evaluator for .NET

parser csharp math runtime dotnet evaluation antlr antlr4 expressions ncalc parlot

Updated Jun 10, 2024
C#

open-compass / VLMEvalKit

Star

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks

computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip claude openai-api gpt4 large-language-models llm chatgpt llava qwen gpt-4v

Updated Jun 10, 2024
Python

VectorInstitute / cyclops

Star

Toolkit for evaluating and monitoring AI models in clinical settings

machine-learning deep-learning evaluation healthcare physionet mimic-iii electronic-health-record clinical-research eicu-crd clinical-data clinical-decision-support drift-detection model-monitoring data-drift omop-cdm mimic-iv electronic-medical-record

Updated Jun 10, 2024
Python

langchain-ai / langsmith-sdk

Star

LangSmith Client SDK Implementations

evaluation language-model observability

Updated Jun 10, 2024
Python

EXP-Tools / steam-discount

Star

steam 特惠游戏榜单（自动刷新）

steam crawler evaluation rank discount zero playing

Updated Jun 10, 2024
Python

lunary-ai / lunary

Star

The production toolkit for LLMs. Observability, prompt management and evaluations.

testing ai monitoring evaluation logs self-hosted openai hacktoberfest observability prompts llm langchain

Updated Jun 10, 2024
TypeScript

symflower / eval-dev-quality

Star

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

evaluation software-development software-quality evaluation-framework llms

Updated Jun 10, 2024
Go

huggingface / evaluate

Star

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

machine-learning evaluation

Updated Jun 10, 2024
Python

ScandEval / ScandEval

Star

Evaluation of language models on mono- or multilingual tasks.

nlp german norwegian evaluation english icelandic swedish dutch danish faroese llm

Updated Jun 10, 2024
Python

gagolews / deepr

Star

Deep R Programming (Open-Access Textbook)

data-science cran r statistics functional-programming graphics vector evaluation data-frame scientific-visualization scientific-computing tensor vectorization numerical-methods numerical-simulations matrix-calculations statistics-for-engineering statistics-for-data-science

Updated Jun 10, 2024

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jun 10, 2024
TypeScript

langwatch / langwatch

Star

🤖 Build AI applications with confidence ✅ DSPy Visualizer ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.

ai analytics evaluation openai gpt datasets observability llm prompt-engineering

Updated Jun 10, 2024
TypeScript

langwatch / langevals

Star

LangEvals aggregates various language model evaluators into a single platform, providing a standard interface for a multitude of scores and LLM guardrails, for you to protect and benchmark your LLM models and pipelines.

evaluation openai guardrails llm

Updated Jun 10, 2024
Jupyter Notebook

lisiarend / PRONE

Star

R Package for preprocessing, normalizing, and analyzing proteomics data

evaluation data-analysis proteomics normalization

Updated Jun 10, 2024
R

microsoft / rag-experiment-accelerator

Star

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.