We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
public static void utf8ToGbk() throws Exception { String fileName = "c:/tokenizer.json"; List lines = Files.readAllLines(Paths.get(fileName), Charset.forName("utf-8")); String sentence = null; int size = lines.size(); for (int i = 0; i < size; i++) { sentence = lines.get(i); //System.out.println(sentence); System.out.println(new String(sentence.getBytes("GBK"))); } }
这样也看不到中文,该怎么操作才能看到词汇表里的中文token?
The text was updated successfully, but these errors were encountered:
这个不是这样看的
Sorry, something went wrong.
文本编辑器已设为utf-8也看不到,怎样才能看到呢?
这个不是这样看的 文本编辑器已设为utf-8也看不到,怎样才能看到呢?
我建议读一下llama3 的tokenizer的方式。里面应该没有办法直接读取到中文。中文都被拆解开了。
llama3代码很少,看不出怎么读中文,怎么训练?
No branches or pull requests
public static void utf8ToGbk() throws Exception {
String fileName = "c:/tokenizer.json";
List lines = Files.readAllLines(Paths.get(fileName), Charset.forName("utf-8"));
String sentence = null;
int size = lines.size();
for (int i = 0; i < size; i++) {
sentence = lines.get(i);
//System.out.println(sentence);
System.out.println(new String(sentence.getBytes("GBK")));
}
}
这样也看不到中文,该怎么操作才能看到词汇表里的中文token?
The text was updated successfully, but these errors were encountered: