Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command-R-Plus, Context Window Limitations #660

Open
jeanromainroy opened this issue Apr 8, 2024 · 42 comments
Open

Command-R-Plus, Context Window Limitations #660

jeanromainroy opened this issue Apr 8, 2024 · 42 comments

Comments

@jeanromainroy
Copy link

jeanromainroy commented Apr 8, 2024

Cohere's new Command-R-Plus model reportedly features a 128k context window. However, testing with progressively longer prompts reveals it begins producing nonsensical output (e.g., "<PAD><PAD>...") after 8192 tokens, aligning with the "max_position_embeddings" value in the config.json file. The config also lists a "rope_theta" value, suggesting its role in achieving the large context window. Is "rope" supported in MLX?

@awni
Copy link
Member

awni commented Apr 8, 2024

However, testing with progressively longer prompts reveals it begins producing nonsensical output (e.g., "...") after 8192 tokens, aligning with the "max_position_embeddings" value in the config.json file

🤔 not sure what would cause that. Do you have a prompt that should work in the MLX version that doesn't? Also if you are able to provide some expected output that would also be helpful.

The config also lists a "rope_theta" value, suggesting its role in achieving the large context window. Is "rope" supported in MLX?

MLX has RoPE and it should be used correctly already.

@fblissjr
Copy link

fblissjr commented Apr 8, 2024

I'm getting random Cyrillic in my responses when using tokenizer.apply_tool_use_template. Anyone else? Seems to only be when using that tool template from the tokenizer.

Example output:

Write 'Action:' followed by a json-formatted list of actions that you want to perform in order to produce a good response to the user's last input. You can use any of the supplied tools any number of times, but you should aim to execute the minimum number of necessary actions for the input. You should use the directly-answer tool if calling the other tools is unnecessary. The list of actions you want to call should be formatted as a list of json objects, for example:

[
    {
        "tool_name": title of the tool in the specification,
        "parameters": a dict of parameters to input into the tool as they are defined in the specs, or {} if it takes no parameters
    }
]```<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
Action: ```json
[
    {
        "tool некоторыми": {},
        "tool_name": "internet_search"
    } forniscono]
```<EOS_TOKEN>

@fblissjr
Copy link

fblissjr commented Apr 8, 2024

ignore, I was calling the tokenizer twice. fixed it in my code here for anyone who wants to test tool use (apologies in advance if there are bugs still lurking): https://github.com/fblissjr/mlx-funbox

@fblissjr
Copy link

fblissjr commented Apr 8, 2024

looks like there's still random switching to multilingual and random Cyrillic (using a simple generate + apply tool template). has anyone tested on CUDA to see if similar?

@fblissjr
Copy link

fblissjr commented Apr 8, 2024

Looks like it's the tokenizer.json that is not converting correctly. See the tokenizer.json from the Cohere HF model repo:
Screenshot 2024-04-08 at 12 29 23 PM

Compared to a freshly converted mlx_lm.convert -q (no other params) I just did from that same repo 20 minutes ago, which also matches the tokenizer.json from the mlx-community quant uploaded earlier (mlx-community/c4ai-command-r-plus-4bit):
Screenshot 2024-04-08 at 12 31 26 PM

@fblissjr
Copy link

fblissjr commented Apr 8, 2024

Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!)

My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting.

edit: generation speed is also slightly faster now due to the correct tokenizer being used.

@awni
Copy link
Member

awni commented Apr 8, 2024

That is very odd. The tokenizer copying is very simple in MLX LM. We basically load with Hugging Face and then save it with Hugging Face. There is no MLX code involved. https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/utils.py#L619

I wonder if we are somehow using the API incorrectly or maybe there is a bug in the way it's saved with Transformers.

@awni
Copy link
Member

awni commented Apr 8, 2024

@fblissjr you can reproduce the behavior with:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-plus")
tokenizer.save_pretrained(".")

I feel that should not break the tokenizer.. so it might be worth filing an issue with the Cohere HF repo or the Transformer repos? Wdyt?

@fblissjr
Copy link

fblissjr commented Apr 8, 2024

@awni my guess is the latter. looks more like it's saved incorrectly (and oddly just by looking at it) in the hf repo. Haven't seen a tokenizer.json like this before. here's a quick sample of 1 page on more tokenizer.json from https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json

{"version": "1.0", "truncation": null, "padding": null, "added_tokens": [{"id": 0, "content": "", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 1, "content": "", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 2, "content": "", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 3, "content": "", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 4, "content": "<MASK_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 5, "content": "<BOS_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 6, "content": "<EOS_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 7, "content": "<EOP_TOKEN>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true}, {"id": 255000, "special": false, "content": "<|START_OF_TURN_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255001, "special": false, "content": "<|END_OF_TURN_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255002, "special": false, "content": "<|YES_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255003, "special": false, "content": "<|NO_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255004, "special": false, "content": "<|GOOD_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255005, "special": false, "content": "<|BAD_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255006, "special": false, "content": "<|USER_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255007, "special": false, "content": "<|CHATBOT_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255008, "special": false, "content": "<|SYSTEM_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255009, "special": false, "content": "<|USER_0_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255010, "special": false, "content": "<|USER_1_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255011, "special": false, "content": "<|USER_2_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255012, "special": false, "content": "<|USER_3_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255013, "special": false, "content": "<|USER_4_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255014, "special": false, "content": "<|USER_5_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255015, "special": false, "content": "<|USER_6_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255016, "special": false, "content": "<|USER_7_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255017, "special": false, "content": "<|USER_8_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255018, "special": false, "content": "<|USER_9_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255019, "special": false, "content": "<|EXTRA_0_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255020, "special": false, "content": "<|EXTRA_1_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255021, "special": false, "content": "<|EXTRA_2_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255022, "special": false, "content": "<|EXTRA_3_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255023, "special": false, "content": "<|EXTRA_4_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255024, "special": false, "content": "<|EXTRA_5_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255025, "special": false, "content": "<|EXTRA_6_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255026, "special": false, "content": "<|EXTRA_7_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255027, "special": false, "content": "<|EXTRA_8_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}, {"id": 255028, "special": false, "content": "<|EXTRA_9_TOKEN|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}], "normalizer": {"type": "NFC"}, "pre_tokenizer": {"type": "Sequence", "pretokenizers": [{"type": "Digits", "individual_digits": true}, {"type": "ByteLevel", "add_prefix_space": false, "trim_offsets": true, "use_regex": true}]}, "post_processor": {"add_prefix_space": true, "trim_offsets": false, "use_regex": true, "type": "TemplateProcessing", "single": [{"SpecialToken": {"id": "<BOS_TOKEN>", "type_id": 0}}, {"Sequence": {"id": "A", "type_id": 0}}, {"SpecialToken": {"id": "<|END_OF_TURN_TOKEN|>", "type_id": 1}}, {"SpecialToken": {"id": "<EOS_TOKEN>", "type_id": 1}}], "pair": [{"SpecialToken": {"id": "<BOS_TOKEN>", "type_id": 0}}, {"Sequence": {"id": "A", "type_id": 0}}, {"Sequence": {"id": "B", "type_id": 1}}, {"SpecialToken": {"id": "<|END_OF_TURN_TOKEN|>", "type_id": 1}}, {"SpecialToken": {"id": "<EOS_TOKEN>", "type_id": 1}}], "special_tokens": {"<BOS_TOKEN>": {"id": "<BOS_TOKEN>", "ids": [5], "tokens": ["<BOS_TOKEN>"]}, "<EOS_TOKEN>": {"id": "<EOS_TOKEN>", "ids": [6], "tokens": ["<EOS_TOKEN>"]}, "<|END_OF_TURN_TOKEN|>": {"id": "<|END_OF_TURN_TOKEN|>", "ids": [255001], "tokens": ["<|END_OF_TURN_TOKEN|>"]}}}, "decoder": {"type": "ByteLevel", "add_prefix_space": true, "trim_offsets": true, "use_regex": true}, "model": {"type": "BPE", "dropout": null, "unk_token": null, "continuing_subword_prefix": null, "end_of_word_suffix": null, "fuse_unk": false, "byte_fallback": false, "vocab": {"": 0, "": 1, "": 2, "": 3, "<MASK_TOKEN>": 4, "<BOS_TOKEN>": 5, "<EOS_TOKEN>": 6, "<EOP_TOKEN>": 7, "!": 8, """: 9, "#": 10, "$": 11, "%": 12, "&": 13, "'": 14, "(": 15, ")": 16, "*": 17, "+": 18, ",": 19, "-": 20, ".": 21, "/": 22, "0": 23, "1": 24, "2": 25, "3": 26, "4": 27, "5": 28, "6": 29, "7": 30, "8": 31, "9": 32, ":": 33, ";": 34, "<": 35, "=": 36, ">": 37, "?": 38, "@": 39, "A": 40, "B": 41, "C": 42, "D": 43, "E": 44, "F": 45, "G": 46, "H": 47, "I": 48, "J": 49, "K": 50, "L": 51, "M": 52, "N": 53, "O": 54, "P": 55, "Q": 56, "R": 57, "S": 58, "T": 59, "U": 60, "V": 61, "W": 62, "X": 63, "Y": 64, "Z": 65, "[": 66, "\": 67, "]": 68, "^": 69, "_": 70, "`": 71, "a": 72, "b": 73, "c": 74, "d": 75, "e": 76, "f": 77, "g": 78, "h": 79, "i": 80, "j": 81, "k": 82, "l": 83, "m": 84, "n": 85, "o": 86, "p": 87, "q": 88, "r": 89, "s": 90, "t": 91, "u": 92, "v": 93, "w": 94, "x": 95, "y": 96, "z": 97, "{": 98, "|": 99, "}": 100, "~": 101, "\u00a1": 102, "\u00a2": 103, "\u00a3": 104, "\u00a4": 105, "\u00a5": 106, "\u00a6": 107, "\u00a7": 108, "\u00a8": 109, "\u00a9": 110, "\u00aa": 111, "\u00ab": 112, "\u00ac": 113, "\u00ae": 114, "\u00af": 115, "\u00b0": 116, "\u00b1": 117, "\u00b2": 118, "\u00b3": 119, "\u00b4": 120, "\u00b5": 121, "\u00b6": 122, "\u00b7": 123, "\u00b8": 124, "\u00b9": 125, "\u00ba": 126, "\u00bb": 127, "\u00bc": 128, "\u00bd": 129, "\u00be": 130, "\u00bf": 131, "\u00c0": 132, "\u00c1": 133, "\u00c2": 134, "\u00c3": 135, "\u00c4": 136, "\u00c5": 137, "\u00c6": 138, "\u00c7": 139, "\u00c8": 140, "\u00c9": 141, "\u00ca": 142, "\u00cb": 143, "\u00cc": 144, "\u00cd": 145, "\u00ce": 146, "\u00cf": 147, "\u00d0": 148, "\u00d1": 149, "\u00d2": 150, "\u00d3": 151, "\u00d4": 152, "\u00d5": 153, "\u00d6": 154, "\u00d7": 155, "\u00d8": 156, "\u00d9": 157, "\u00da": 158, "\u00db": 159, "\u00dc": 160, "\u00dd": 161, "\u00de": 162, "\u00df": 163, "\u00e0": 164, "\u00e1": 165, "\u00e2": 166, "\u00e3": 167, "\u00e4": 168, "\u00e5": 169, "\u00e6": 170, "\u00e7": 171, "\u00e8": 172, "\u00e9": 173, "\u00ea": 174, "\u00eb": 175, "\u00ec": 176, "\u00ed": 177, "\u00ee": 178, "\u00ef": 179, "\u00f0": 180, "\u00f1": 181, "\u00f2": 182, "\u00f3": 183, "\u00f4": 184, "\u00f5": 185, "\u00f6": 186, "\u00f7": 187, "\u00f8": 188, "\u00f9": 189, "\u00fa": 190, "\u00fb": 191, "\u00fc": 192, "\u00fd": 193, "\u00fe": 194, "\u00ff": 195, "\u0100": 196, "\u0101": 197, "\u0102": 198, "\u0103": 199, "\u0104": 200, "\u0105": 201, "\u0106": 202, "\u0107": 203, "\u0108": 204, "\u0109": 205, "\u010a": 206, "\u010b": 207, "\u010c": 208, "\u010d": 209, "\u010e": 210, "\u010f": 211, "\u0110": 212, "\u0111": 213, "\u0112": 214, "\u0113": 215, "\u0114": 216, "\u0115": 217, "\u0116": 218, "\u0117": 219, "\u0118": 220, "\u0119": 221, "\u011a": 222, "\u011b": 223, "\u011c": 224, "\u011d": 225, "\u011e": 226, "\u011f": 227, "\u0120": 228, "\u0121": 229, "\u0122": 230, "\u0123": 231, "\u0124": 232, "\u0125": 233, "\u0126": 234, "\u0127": 235, "\u0128": 236, "\u0129": 237, "\u012a": 238, "\u012b": 239, "\u012c": 240, "\u012d": 241, "\u012e": 242, "\u012f": 243, "\u0130": 244, "\u0131": 245, "\u0132": 246, "\u0133": 247, "\u0134": 248, "\u0135": 249, "\u0136": 250, "\u0137": 251, "\u0138": 252, "\u0139": 253, "\u013a": 254, "\u013b": 255, "\u013c": 256, "\u013d": 257, "\u013e": 258, "\u013f": 259, "\u0140": 260, "\u0141": 261, "\u0142": 262, "\u0143": 263, "\u200d": 264, "\u203c": 265, "\u2049": 266, "\u20e3": 267, "\u2122": 268, "\u2139": 269, "\u2194": 270, "\u2195": 271, "\u2196": 272, "\u2197": 273, "\u2198": 274, "\u2199": 275, "\u21a9": 276, "\u21aa": 277, "\u231a": 278, "\u231b": 279, "\u2328": 280, "\u23cf": 281, "\u23e9": 282, "\u23ea": 283, "\u23eb": 284, "\u23ec": 285, "\u23ed": 286, "\u23ee": 287, "\u23ef": 288, "\u23f0": 289, "\u23f1": 290, "\u23f2": 291, "\u23f3": 292, "\u23f8": 293, "\u23f9": 294, "\u23fa": 295, "\u24c2": 296, "\u25aa": 297, "\u25ab": 298, "\u25b6": 299, "\u25c0": 300, "\u25fb": 301, "\u25fc": 302, "\u25fd": 303, "\u25fe": 304, "\u2600": 305, "\u2601": 306, "\u2602": 307, "\u2603": 308, "\u2604": 309, "\u260e": 310, "\u2611": 311, "\u2614": 312, "\u2615": 313, "\u2618": 314, "\u261d": 315, "\u2620": 316, "\u2622": 317, "\u2623": 318, "\u2626": 319, "\u262a": 320, "\u262e": 321, "\u262f": 322, "\u2638": 323, "\u2639": 324, "\u263a": 325, "\u2640": 326, "\u2642": 327, "\u2648": 328, "\u2649": 329, "\u264a": 330, "\u264b": 331, "\u264c": 332, "\u264d": 333, "\u264e": 334, "\u264f": 335, "\u2650": 336, "\u2651": 337, "\u2652": 338, "\u2653": 339, "\u265f": 340, "\u2660": 341, "\u2663": 342, "\u2665": 343, "\u2666": 344, "\u2668": 345, "\u267b": 346, "\u267e": 347, "\u267f": 348, "\u2692": 349, "\u2693": 350, "\u2694": 351, "\u2695": 352, "\u2696": 353, "\u2697": 354, "\u2699": 355, "\u269b": 356, "\u269c": 357, "\u26a0": 358, "\u26a1": 359, "\u26a7": 360, "\u26aa": 361, "\u26ab": 362, "\u26b0": 363, "\u26b1": 364, "\u26bd": 365, "\u26be": 366, "\u26c4": 367, "\u26c5": 368, "\u26c8": 369, "\u26ce": 370, "\u26cf": 371, "\u26d1": 372, "\u26d3": 373, "\u26d4": 374, "\u26e9": 375, "\u26ea": 376, "\u26f0": 377, "\u26f1": 378, "\u26f2": 379, "\u26f3": 380, "\u26f4": 381, "\u26f5": 382, "\u26f7": 383, "\u26f8": 384, "\u26f9": 385, "\u26fa": 386, "\u26fd": 387, "\u2702": 388, "\u2705": 389, "\u2708": 390, "\u2709": 391, "\u270a": 392, "\u270b": 393, "\u270c": 394, "\u270d": 395, "\u270f": 396, "\u2712": 397, "\u2714": 398, "\u2716": 399, "\u271d": 400, "\u2721": 401, "\u2728": 402, "\u2733": 403, "\u2734": 404, "\u2744": 405, "\u2747": 406, "\u274c": 407, "\u274e": 408, "\u2753": 409, "\u2754": 410, "\u2755": 411, "\u2757": 412, "\u2763": 413, "\u276

@fblissjr
Copy link

fblissjr commented Apr 8, 2024

@fblissjr you can reproduce the behavior with:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-plus")
tokenizer.save_pretrained(".")

I feel that should not break the tokenizer.. so it might be worth filing an issue with the Cohere HF repo or the Transformer repos? Wdyt?

agreed. i made a community post on hf here: https://huggingface.co/CohereForAI/c4ai-command-r-plus/discussions/15

and here: huggingface/transformers#30027

@fblissjr
Copy link

fblissjr commented Apr 8, 2024

so this is interesting - the tokenizer.json on the bitsandbytes repo linked from the main cohere repo is a different size, and looks nothing like the original. https://huggingface.co/CohereForAI/c4ai-command-r-plus-4bit/blob/main/tokenizer.json

@fblissjr
Copy link

fblissjr commented Apr 8, 2024

another interesting difference between the 4 bit bnb tokenizer and the original - in the original one, token id token id 255001 <|END_OF_TURN_TOKEN|>, special is set to False. In the 4bit bnb one, it's True.

@fblissjr
Copy link

fblissjr commented Apr 8, 2024

Per comments on the hugging face repo, the differences between the two tokenizers.json files are unicode differences. I'll assume I've got something bugging on my end unless anyone else sees the same.

@jeanromainroy
Copy link
Author

# Libraries
from transformers import AutoTokenizer
import mlx.core as mx
import mlx_lm
from mlx_lm.utils import load_model, get_model_path


# Language Model
PATH_MODEL = "/Users/admin/Models/CohereForAI/c4ai-command-r-plus-4bit/"


# Load the model & tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_MODEL)
model = load_model(get_model_path(PATH_MODEL))


# Incrementally longer texts
...
text_7500_tokens = "Lorem ipsum dolor sit..."    # Works
text_8500_tokens = "Lorem ipsum dolor sit..."    # Stops working
...


# Format as list of messages
messages = [
    {"role": "user", "content": f"{text_8500_tokens}\n\nSummarize the text above in one short paragraph."}    # <-- set a text
]


# Apply chat template
prompt_decorated = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)


# Generate
response = mlx_lm.generate(
    model,
    tokenizer,
    prompt_decorated,
    temp=0.0,
    max_tokens=64
)

This is what I have been using – I removed the texts which are just some random wikipedia page. The output is good until I try the 8500 tokens text which just outputs <PAD><PAD><PAD><PAD><PAD>...

@fblissjr
Copy link

fblissjr commented Apr 9, 2024

# Libraries
from transformers import AutoTokenizer
import mlx.core as mx
import mlx_lm
from mlx_lm.utils import load_model, get_model_path


# Language Model
PATH_MODEL = "/Users/admin/Models/CohereForAI/c4ai-command-r-plus-4bit/"


# Load the model & tokenizer
tokenizer = AutoTokenizer.from_pretrained(PATH_MODEL)
model = load_model(get_model_path(PATH_MODEL))


# Incrementally longer texts
...
text_7500_tokens = "Lorem ipsum dolor sit..."    # Works
text_8500_tokens = "Lorem ipsum dolor sit..."    # Stops working
...


# Format as list of messages
messages = [
    {"role": "user", "content": f"{text_8500_tokens}\n\nSummarize the text above in one short paragraph."}    # <-- set a text
]


# Apply chat template
prompt_decorated = tokenizer.apply_chat_template(
    messages, 
    tokenize=False, 
    add_generation_prompt=True
)


# Generate
response = mlx_lm.generate(
    model,
    tokenizer,
    prompt_decorated,
    temp=0.0,
    max_tokens=64
)

This is what I have been using – I removed the texts which are just some random wikipedia page. The output is good until I try the 8500 tokens text which just outputs ...

Have you tried with apply_tool_template by chance? Curious if you see any of the oddities I see when using it.

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 9, 2024

Hey guys @awni, @fblissjr and @jeanromainroy,

The cohere team limited the context to 8k for all Command-R variants on purpose. If you check the config file for both r-v01 and r+, the max_position_embeddings is set to 8192.

It's a limit to avoid users experiencing OOM.

You can read more here:
https://huggingface.co/CohereForAI/c4ai-command-r-v01/discussions/12

@jeanromainroy
Copy link
Author

Hey @Blaizzy, I have run the exact same test with the new llama.cpp implementation of Command-R+ and it works way above 8k tokens.

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 9, 2024

Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!)

My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting.

edit: generation speed is also slightly faster now due to the correct tokenizer being used.

@fblissjr Indeed the tokenizer created from the conversion is slightly smaller ~2MB than the original.

I updated as you suggested. Can you check it?

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 9, 2024

Hey @Blaizzy, I have run the exact same test with the new llama.cpp implementation of Command-R+ and it works way above 8k tokens.

@jeanromainroy can you try again with the change in this branch, if it works I will make a PR.

pip install -U git+https://github.com/Blaizzy/mlx-examples.git@pc/commandR#subdirectory=llms --use-pep517 

Link: https://github.com/Blaizzy/mlx-examples/tree/pc/commandR

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 9, 2024

You can also try to increase the default max_position_embeddings and let me know if it works.

@fblissjr
Copy link

fblissjr commented Apr 9, 2024

Copying the original cohere tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json) fixes this issue completely from my testing (output generation is slow, but so far so good!)
My guess is something is happening in the mlx_lm.convert process due to the large size of the vocab + the multilingual nature of the tokenizer + the strange tokenizer.json formatting.
edit: generation speed is also slightly faster now due to the correct tokenizer being used.

@fblissjr Indeed the tokenizer created from the conversion is slightly smaller ~2MB than the original.

I updated as you suggested. Can you check it?

Actually did this myself yesterday with my own quant, and output was better and faster - no idea why. And now unsure if I just had a bug somewhere on my end or if it actually made a difference.

I'm planning to test out a larger CUDA machine later today or tomorrow to see how it works natively.

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 9, 2024

Let me know how it goes, but for now according to your report the issue should be fixed.

@jeanromainroy
Copy link
Author

Hey @Blaizzy , I tried your fork and the model is still outputting <PAD><PAD><PAD>... when I provide a long prompt.

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 9, 2024

Hey @Blaizzy , I tried your fork and the model is still outputting ... when I provide a long prompt.

I have made a new change, can you try it again please :)

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 9, 2024

Wait, I think I got it!

Give me 30 min :)

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 9, 2024

@jeanromainroy can you try this branch, the previous one had a git issue:

https://github.com/Blaizzy/mlx-examples/tree/pc/command-R

@jeanromainroy
Copy link
Author

Still outputting <PAD><PAD><PAD>... :(

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 9, 2024

Only PAD ? Can you share the whole output?

@jeanromainroy
Copy link
Author

It's outputting <PAD> for as long as I let it. In other words, max_tokens=256, results in 256 x <PAD>

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 9, 2024

Got it!

@awni the cohere team added model_max_length set to 128K on both command-r models.

Is there a way of setting using this number with the nn.Rope? Are there any deep changes needed? If so, please point them, I can work on it.

@awni
Copy link
Member

awni commented Apr 9, 2024

I'm not sure I fiollow your question. The nn.RoPE layer does not make any assumptions about the maximum sequence length.

@jeanromainroy
Copy link
Author

He's talking about the Command-R+'s config:

image

@awni
Copy link
Member

awni commented Apr 10, 2024

@jeanromainroy regarding:

I have run the exact same test with the new llama.cpp implementation of Command-R+ and it works way above 8k tokens.

My understanding is llama.cpp uses a fixed size context (the -c flag) so while it works for long context it does not really "work" in the sense that it does not use any context before a certain limit. It simply truncates everything before the -c tokens. Default value for -c is 512:

  -c N, --ctx-size N    size of the prompt context (default: 512, 0 = loaded from model)

Maybe we could provide something similar.. but I think the default behavior is a little misleading.

@awni
Copy link
Member

awni commented Apr 10, 2024

@fblissjr could you share a command using mlx_lm.generate that produces strange responses? I would like to debug that issue if you are still experiencing it.

@fblissjr
Copy link

@fblissjr could you share a command using mlx_lm.generate that produces strange responses? I would like to debug that issue if you are still experiencing it.

I can't with mlx_lm.generate because it only happens when I run it with apply_tool_template in the tokenizer. Not at home right now and haven't tested since the day it happened. I think you can mock up something like this:

tools = (a json object similar to the cohere example on HF)
if tools, tokenizer.apply_tool_use_template (applied to prompt)

Basically you want to get the apply_tool_use_template to show up, which is a big page-ish long output that looks like this (copying and pasting from HF repo under tool use output example):

<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
The instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.

System Preamble

Basic Rules

You are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.

User Preamble

Task and Context

You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

Style Guide

Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.

Available Tools

Here is a list of tools that you have available to you:

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 10, 2024

@awni you can use the example in the MLX model card to replicate @fblissjr example:

https://huggingface.co/mlx-community/c4ai-command-r-plus-4bit

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 10, 2024

The nn.RoPE layer does not make any assumptions about the maximum sequence length.
My understanding is llama.cpp uses a fixed size context (the -c flag) so while it works for long context it does not really "work" in the sense that it does not use any context before a certain limit. It simply truncates everything before the -c tokens. Default value for -c is 512:

  -c N, --ctx-size N    size of the prompt context (default: 512, 0 = loaded from model)

Maybe we could provide something similar.. but I think the default behavior is a little misleading.

I see now.

I thought because the PyTorch implementation takes context window size into account we were missing something.

Something like this:

class RotaryPositionalEmbeddings(nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
        super().__init__()
        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.device=device
        self.scaling_factor = scaling_factor
        self.base = base
        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
        self.register_buffer("inv_freq", inv_freq, persistent=False)

        # Build here to make `torch.jit.trace` work.
        self._set_cos_sin_cache(
            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
        )

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len
        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
        t = t / self.scaling_factor
        freqs = torch.outer(t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)

    @torch.no_grad()
    def forward(self, x, seq_len=None):
        # x: [bs, num_attention_heads, seq_len, head_size]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)

        return (
            self.cos_cached[:seq_len].to(dtype=x.dtype),
            self.sin_cached[:seq_len].to(dtype=x.dtype),
        )

@Blaizzy
Copy link
Contributor

Blaizzy commented Apr 10, 2024

The nn.RoPE layer does not make any assumptions about the maximum sequence length.

Btw does that mean that we can use context size of any length with nn.RoPE? If not, what are the limitations?

@M-I
Copy link

M-I commented May 6, 2024

Hi folks, still no consensus on what settings to set, what json files to use?

Using the mlx-community 4bit version, I have random japanese characters. What surprised me was that without even mentionning it in my prompt, at some point the model acknowledged and apoligized for their randomness and said it would try to avoid them.

@Blaizzy
Copy link
Contributor

Blaizzy commented May 6, 2024

@M-I could you elaborate on what you mean?

@M-I
Copy link

M-I commented May 6, 2024

starting with

from mlx_lm import load, generate
model, tokenizer = load("mlx-community/c4ai-command-r-plus-4bit")

at some point, generate(model, tokenizer, prompt=conversation, tokenize=False, add_generation_prompt=True), verbose=True, max_tokens=1000) generated a bit where there was:
"Let's start with defining the specific requirements and breaking down the project into manageable tasks. Feel free to share your thoughts and questions along the way! I'm excited to see your project come to life! 😊👍

P.S. Apologies for the random Japanese words (e.g., "グリーニング") that appeared in my response. It seems there might be an issue with my language model. I'll try to avoid this in future responses. 😅👍. I'm always ready to assist you with your project! 😊".

And it will go on and on, as is the end_token is never generated or acknowledged. I just assumed it was the price to pay for 4bit quantization, so I never mentioned the fact that there was Japanese, or even anything weird or out of place in it's response, but it just self reflected on its own.
I'll try to find a prompt that I can share that creates similar responses. But others 12 in the discussion tab of the 4bit mlx model on hugginface noticed some weirdness.

@Blaizzy
Copy link
Contributor

Blaizzy commented May 6, 2024

I see, thank you for explaining :)

I think this should be a new issue.

As it's not related with this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants