Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix self extend on the server. #7239

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Maximilian-Winter
Copy link
Contributor

@Maximilian-Winter Maximilian-Winter commented May 12, 2024

The self extend is broken on the server according to this.
#7005
This PR tries to fix the self extend mechanism in the server. I tested it with passkey test and it could predict the passkey correctly. I have replicated the passkey test of llama.cpp, because I wasn't sure about how to interpret the results of the behave run. I basically copied the showed prompt from the behave passkey test and added token "[INST]" at the beginning and "[/INST]" at the end. Then I runned it on the completion endpoint.

Would be happy if someone could test it and give it a try

Edit:
Did another test with mistral instruct v0.2 with 50.000 context text and the passkey once in the middle. It worked really well. Did another test without self extend enabled and it sayed that the passkey isn't in the text.

@mofosyne mofosyne added examples review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server help wanted Extra attention is needed labels May 12, 2024
Copy link
Contributor

github-actions bot commented May 12, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 551 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8491.26ms p(95)=20477.4ms fails=, finish reason: stop=483 truncated=68
  • Prompt processing (pp): avg=101.21tk/s p(95)=430.92tk/s
  • Token generation (tg): avg=34.55tk/s p(95)=48.95tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fixed_self_extension commit=f4f5b7ac560de66be4e875210f8c3679ef4b3dac

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715525184 --> 1715525812
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 310.7, 310.7, 310.7, 310.7, 310.7, 680.35, 680.35, 680.35, 680.35, 680.35, 647.95, 647.95, 647.95, 647.95, 647.95, 711.44, 711.44, 711.44, 711.44, 711.44, 776.35, 776.35, 776.35, 776.35, 776.35, 773.83, 773.83, 773.83, 773.83, 773.83, 782.04, 782.04, 782.04, 782.04, 782.04, 805.93, 805.93, 805.93, 805.93, 805.93, 805.69, 805.69, 805.69, 805.69, 805.69, 818.85, 818.85, 818.85, 818.85, 818.85, 840.5, 840.5, 840.5, 840.5, 840.5, 868.04, 868.04, 868.04, 868.04, 868.04, 877.19, 877.19, 877.19, 877.19, 877.19, 853.75, 853.75, 853.75, 853.75, 853.75, 846.87, 846.87, 846.87, 846.87, 846.87, 856.02, 856.02, 856.02, 856.02, 856.02, 852.4, 852.4, 852.4, 852.4, 852.4, 864.1, 864.1, 864.1, 864.1, 864.1, 867.31, 867.31, 867.31, 867.31, 867.31, 872.78, 872.78, 872.78, 872.78, 872.78, 872.53, 872.53, 872.53, 872.53, 872.53, 874.02, 874.02, 874.02, 874.02, 874.02, 868.9, 868.9, 868.9, 868.9, 868.9, 862.88, 862.88, 862.88, 862.88, 862.88, 862.8, 862.8, 862.8, 862.8, 862.8, 864.61, 864.61, 864.61, 864.61, 864.61, 866.36, 866.36, 866.36, 866.36, 866.36, 864.08, 864.08, 864.08, 864.08, 864.08, 861.68, 861.68, 861.68, 861.68, 861.68, 864.2, 864.2, 864.2, 864.2, 864.2, 867.46, 867.46, 867.46, 867.46, 867.46, 865.24, 865.24, 865.24, 865.24, 865.24, 866.24, 866.24, 866.24, 866.24, 866.24, 876.57, 876.57, 876.57, 876.57, 876.57, 886.03, 886.03, 886.03, 886.03, 886.03, 892.31, 892.31, 892.31, 892.31, 892.31, 892.69, 892.69, 892.69, 892.69, 892.69, 890.24, 890.24, 890.24, 890.24, 890.24, 888.62, 888.62, 888.62, 888.62, 888.62, 889.97, 889.97, 889.97, 889.97, 889.97, 888.06, 888.06, 888.06, 888.06, 888.06, 897.0, 897.0, 897.0, 897.0, 897.0, 889.51, 889.51, 889.51, 889.51, 889.51, 869.78, 869.78, 869.78, 869.78, 869.78, 867.41, 867.41, 867.41, 867.41, 867.41, 864.59, 864.59, 864.59, 864.59, 864.59, 862.11, 862.11, 862.11, 862.11, 862.11, 862.16, 862.16, 862.16, 862.16, 862.16, 864.17, 864.17, 864.17, 864.17, 864.17, 865.65, 865.65, 865.65, 865.65, 865.65, 867.39, 867.39, 867.39, 867.39, 867.39, 871.63, 871.63, 871.63, 871.63, 871.63, 870.29, 870.29, 870.29, 870.29, 870.29, 870.03, 870.03, 870.03, 870.03, 870.03, 867.18, 867.18, 867.18, 867.18, 867.18, 868.36, 868.36, 868.36, 868.36, 868.36, 867.67, 867.67, 867.67, 867.67, 867.67, 868.58, 868.58, 868.58, 868.58, 868.58, 869.74, 869.74, 869.74, 869.74, 869.74, 872.73, 872.73, 872.73, 872.73, 872.73, 873.02, 873.02, 873.02, 873.02, 873.02]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715525184 --> 1715525812
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.46, 44.46, 44.46, 44.46, 44.46, 40.98, 40.98, 40.98, 40.98, 40.98, 33.18, 33.18, 33.18, 33.18, 33.18, 33.59, 33.59, 33.59, 33.59, 33.59, 33.67, 33.67, 33.67, 33.67, 33.67, 34.11, 34.11, 34.11, 34.11, 34.11, 35.46, 35.46, 35.46, 35.46, 35.46, 35.68, 35.68, 35.68, 35.68, 35.68, 35.88, 35.88, 35.88, 35.88, 35.88, 35.15, 35.15, 35.15, 35.15, 35.15, 35.4, 35.4, 35.4, 35.4, 35.4, 35.27, 35.27, 35.27, 35.27, 35.27, 34.39, 34.39, 34.39, 34.39, 34.39, 33.54, 33.54, 33.54, 33.54, 33.54, 33.15, 33.15, 33.15, 33.15, 33.15, 33.1, 33.1, 33.1, 33.1, 33.1, 33.27, 33.27, 33.27, 33.27, 33.27, 32.94, 32.94, 32.94, 32.94, 32.94, 32.89, 32.89, 32.89, 32.89, 32.89, 32.78, 32.78, 32.78, 32.78, 32.78, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.47, 32.38, 32.38, 32.38, 32.38, 32.38, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.23, 32.21, 32.21, 32.21, 32.21, 32.21, 31.89, 31.89, 31.89, 31.89, 31.89, 31.62, 31.62, 31.62, 31.62, 31.62, 31.41, 31.41, 31.41, 31.41, 31.41, 31.54, 31.54, 31.54, 31.54, 31.54, 31.62, 31.62, 31.62, 31.62, 31.62, 31.78, 31.78, 31.78, 31.78, 31.78, 31.89, 31.89, 31.89, 31.89, 31.89, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.75, 31.61, 31.61, 31.61, 31.61, 31.61, 31.26, 31.26, 31.26, 31.26, 31.26, 31.25, 31.25, 31.25, 31.25, 31.25, 31.27, 31.27, 31.27, 31.27, 31.27, 31.49, 31.49, 31.49, 31.49, 31.49, 31.53, 31.53, 31.53, 31.53, 31.53, 31.61, 31.61, 31.61, 31.61, 31.61, 31.44, 31.44, 31.44, 31.44, 31.44, 31.13, 31.13, 31.13, 31.13, 31.13, 30.67, 30.67, 30.67, 30.67, 30.67, 29.83, 29.83, 29.83, 29.83, 29.83, 29.38, 29.38, 29.38, 29.38, 29.38, 29.42, 29.42, 29.42, 29.42, 29.42, 29.58, 29.58, 29.58, 29.58, 29.58, 29.6, 29.6, 29.6, 29.6, 29.6, 29.79, 29.79, 29.79, 29.79, 29.79, 29.81, 29.81, 29.81, 29.81, 29.81, 29.74, 29.74, 29.74, 29.74, 29.74, 29.64, 29.64, 29.64, 29.64, 29.64, 29.61, 29.61, 29.61, 29.61, 29.61, 29.7, 29.7, 29.7, 29.7, 29.7, 29.87, 29.87, 29.87, 29.87, 29.87, 30.02, 30.02, 30.02, 30.02, 30.02, 30.08, 30.08, 30.08, 30.08, 30.08, 30.13, 30.13, 30.13, 30.13, 30.13, 30.15, 30.15, 30.15, 30.15, 30.15]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715525184 --> 1715525812
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.39, 0.39, 0.39, 0.39, 0.39, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.39, 0.39, 0.39, 0.39, 0.39, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.08, 0.08, 0.08, 0.08, 0.08, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.31, 0.31, 0.31, 0.31, 0.31, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.49, 0.49, 0.49, 0.49, 0.49, 0.59, 0.59, 0.59, 0.59, 0.59, 0.58, 0.58, 0.58, 0.58, 0.58, 0.5, 0.5, 0.5, 0.5, 0.5, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.06, 0.06, 0.06, 0.06, 0.06, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715525184 --> 1715525812
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0]
                    

@Maximilian-Winter
Copy link
Contributor Author

I did additional tests and realized that the code that was previously removed didn't was called anyway. But in my tests it works as it should and it will find the passkey. For example hermes pro llama 8b with 8k context can retrieve the pass key with self extend from 50k tokens text, but will produce garbage without it.

@ggerganov
Copy link
Owner

It's probably better to go back and see which change makes the server test fail

@mofosyne mofosyne marked this pull request as draft May 15, 2024 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples help wanted Extra attention is needed review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants