Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can we use model BertForTokenClassification for lengthy sentences? #77

Open
Swty13 opened this issue Mar 20, 2020 · 4 comments
Open

Comments

@Swty13
Copy link

Swty13 commented Mar 20, 2020

Hi,
As BERT tokenization only supports tokenization of sentence upto 512 so if my text length is greater than 512 How can I proceed?
I used BertForTokenClassification for entity recognition task but because of my text length is large a warning comes -- "Token indices sequence length is longer than the specified maximum sequence length for this BERT model (527 > 512). Running this sequence through BERT will result in indexing errors".
I don't want to trim or truncate my text as it lost the important information I have to pass my whole text.
Could you plz suggest my what should I do or Do you have any other idea to implement named entity recognition.

Thanks in advance.

@kowshik226
Copy link

Hi @Swty13 ,

the pretrained models embedding weights has been set to 512 you can refer in following code
pytorch_transformers/tokenization_bert.py line no 50 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {'bert-base-uncased': 512}

try change the below line in following code in run_ner.py(line no 216) and bert.py (line no 79)
input_ids = tokenizer.convert_tokens_to_ids(ntokens) to input_ids = tokenizer.tokenize(ntokens)

iam having the same issue i will try to find solution

if you find the solution please let me know

Best
kowshik

@Swty13
Copy link
Author

Swty13 commented Apr 21, 2020

@kowshik226

Hi,
Thanks for replying I will definitely try this.
Could you plz help me that what kind of hardware configuration is required to train custom NER model having 1 lakh dataset.

(Currently I have VM server of 32GB and 64GB what configuration should I choose or GPU is must to train BERT model, I am new to BERT so I have no idea about it.)

Thnaks

@ranjeetds
Copy link

@Swty13
For the first question

  1. cut your input into sections with 512 tokens and pass them iteratively for inference. (I have done implementation for the same)

Cons -
You might loose some context due to cutting of the sequence at arbitrary position.

For your second question -

You can always train your model on CPU with 1 lakh dataset (I am assuming sentences) on the said RAM.

@raff7
Copy link

raff7 commented May 29, 2020

There is no way to do it without splitting the input, that is because Google released the pretrained version of best with the 512 limitation, and to remove that limitation you would basically have to pretrain bert from scratch, which is unfeasible and would cost lots of money.

i solved it by splitting the input in less then 500 tokens each, always splitting at the closest period.

def predict(ex):
    modelDir ="./NamedEntityExtraction/bert/out_base/"
    model = Ner(modelDir)
    tokenizer = BertTokenizer.from_pretrained(modelDir, do_lower_case=False)
    tokens = tokenizer.tokenize(ex)
    splitEx=[]
    while len(tokens)>500:
        idx = tokens[:500][::-1].index('.')
        idx = 500-idx
        splitEx.append(tokenizer.convert_tokens_to_string(tokens[:idx]))
        tokens = tokens[idx:]

    splitEx.append(tokenizer.convert_tokens_to_string(tokens))
    output = []
    for e in splitEx:   
        output.extend(model.predict(e))
    return output

edit, i use 500 instead of 512 just because when i use 512 I sometimes still get an error for some reason, possibly because of additional tokens added by the model itseld

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants