Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

词汇表里没看到中文 #323

Open
SidneyLann opened this issue Apr 25, 2024 · 4 comments
Open

词汇表里没看到中文 #323

SidneyLann opened this issue Apr 25, 2024 · 4 comments

Comments

@SidneyLann
Copy link

public static void utf8ToGbk() throws Exception {
String fileName = "c:/tokenizer.json";
List lines = Files.readAllLines(Paths.get(fileName), Charset.forName("utf-8"));
String sentence = null;
int size = lines.size();
for (int i = 0; i < size; i++) {
sentence = lines.get(i);
//System.out.println(sentence);
System.out.println(new String(sentence.getBytes("GBK")));
}
}

这样也看不到中文,该怎么操作才能看到词汇表里的中文token?

@ZHangZHengEric
Copy link
Collaborator

这个不是这样看的

@SidneyLann
Copy link
Author

SidneyLann commented Apr 27, 2024

这个不是这样看的

文本编辑器已设为utf-8也看不到,怎样才能看到呢?

@ZHangZHengEric
Copy link
Collaborator

这个不是这样看的

文本编辑器已设为utf-8也看不到,怎样才能看到呢?

我建议读一下llama3 的tokenizer的方式。里面应该没有办法直接读取到中文。中文都被拆解开了。

@SidneyLann
Copy link
Author

image

llama3代码很少,看不出怎么读中文,怎么训练?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants