词汇表里没看到中文 #323

SidneyLann · 2024-04-25T02:03:22Z

public static void utf8ToGbk() throws Exception {
String fileName = "c:/tokenizer.json";
List lines = Files.readAllLines(Paths.get(fileName), Charset.forName("utf-8"));
String sentence = null;
int size = lines.size();
for (int i = 0; i < size; i++) {
sentence = lines.get(i);
//System.out.println(sentence);
System.out.println(new String(sentence.getBytes("GBK")));
}
}

这样也看不到中文，该怎么操作才能看到词汇表里的中文token?

ZHangZHengEric · 2024-04-26T07:21:36Z

这个不是这样看的

SidneyLann · 2024-04-27T09:07:59Z

这个不是这样看的

文本编辑器已设为utf-8也看不到，怎样才能看到呢？

ZHangZHengEric · 2024-04-27T10:06:13Z

这个不是这样看的

文本编辑器已设为utf-8也看不到，怎样才能看到呢？

我建议读一下llama3 的tokenizer的方式。里面应该没有办法直接读取到中文。中文都被拆解开了。

SidneyLann · 2024-04-27T12:55:53Z

llama3代码很少，看不出怎么读中文，怎么训练？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

词汇表里没看到中文 #323

词汇表里没看到中文 #323

SidneyLann commented Apr 25, 2024

ZHangZHengEric commented Apr 26, 2024

SidneyLann commented Apr 27, 2024 •

edited

ZHangZHengEric commented Apr 27, 2024

SidneyLann commented Apr 27, 2024

词汇表里没看到中文 #323

词汇表里没看到中文 #323

Comments

SidneyLann commented Apr 25, 2024

ZHangZHengEric commented Apr 26, 2024

SidneyLann commented Apr 27, 2024 • edited

ZHangZHengEric commented Apr 27, 2024

SidneyLann commented Apr 27, 2024

SidneyLann commented Apr 27, 2024 •

edited