How can we use model BertForTokenClassification for lengthy sentences? #77

Swty13 · 2020-03-20T07:27:25Z

Hi,
As BERT tokenization only supports tokenization of sentence upto 512 so if my text length is greater than 512 How can I proceed?
I used BertForTokenClassification for entity recognition task but because of my text length is large a warning comes -- "Token indices sequence length is longer than the specified maximum sequence length for this BERT model (527 > 512). Running this sequence through BERT will result in indexing errors".
I don't want to trim or truncate my text as it lost the important information I have to pass my whole text.
Could you plz suggest my what should I do or Do you have any other idea to implement named entity recognition.

Thanks in advance.

kowshik226 · 2020-04-17T12:37:18Z

Hi @Swty13 ,

the pretrained models embedding weights has been set to 512 you can refer in following code
pytorch_transformers/tokenization_bert.py line no 50 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {'bert-base-uncased': 512}

try change the below line in following code in run_ner.py(line no 216) and bert.py (line no 79)
input_ids = tokenizer.convert_tokens_to_ids(ntokens) to input_ids = tokenizer.tokenize(ntokens)

iam having the same issue i will try to find solution

if you find the solution please let me know

Best
kowshik

Swty13 · 2020-04-21T07:20:00Z

@kowshik226

Hi,
Thanks for replying I will definitely try this.
Could you plz help me that what kind of hardware configuration is required to train custom NER model having 1 lakh dataset.

(Currently I have VM server of 32GB and 64GB what configuration should I choose or GPU is must to train BERT model, I am new to BERT so I have no idea about it.)

Thnaks

ranjeetds · 2020-04-21T08:18:04Z

@Swty13
For the first question

cut your input into sections with 512 tokens and pass them iteratively for inference. (I have done implementation for the same)

Cons -
You might loose some context due to cutting of the sequence at arbitrary position.

For your second question -

You can always train your model on CPU with 1 lakh dataset (I am assuming sentences) on the said RAM.

raff7 · 2020-05-29T16:10:03Z

There is no way to do it without splitting the input, that is because Google released the pretrained version of best with the 512 limitation, and to remove that limitation you would basically have to pretrain bert from scratch, which is unfeasible and would cost lots of money.

i solved it by splitting the input in less then 500 tokens each, always splitting at the closest period.

def predict(ex):
    modelDir ="./NamedEntityExtraction/bert/out_base/"
    model = Ner(modelDir)
    tokenizer = BertTokenizer.from_pretrained(modelDir, do_lower_case=False)
    tokens = tokenizer.tokenize(ex)
    splitEx=[]
    while len(tokens)>500:
        idx = tokens[:500][::-1].index('.')
        idx = 500-idx
        splitEx.append(tokenizer.convert_tokens_to_string(tokens[:idx]))
        tokens = tokens[idx:]

    splitEx.append(tokenizer.convert_tokens_to_string(tokens))
    output = []
    for e in splitEx:   
        output.extend(model.predict(e))
    return output

edit, i use 500 instead of 512 just because when i use 512 I sometimes still get an error for some reason, possibly because of additional tokens added by the model itseld

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we use model BertForTokenClassification for lengthy sentences? #77

How can we use model BertForTokenClassification for lengthy sentences? #77

Swty13 commented Mar 20, 2020

kowshik226 commented Apr 17, 2020

Swty13 commented Apr 21, 2020

ranjeetds commented Apr 21, 2020

raff7 commented May 29, 2020 •

edited

How can we use model BertForTokenClassification for lengthy sentences? #77

How can we use model BertForTokenClassification for lengthy sentences? #77

Comments

Swty13 commented Mar 20, 2020

kowshik226 commented Apr 17, 2020

Swty13 commented Apr 21, 2020

ranjeetds commented Apr 21, 2020

raff7 commented May 29, 2020 • edited

raff7 commented May 29, 2020 •

edited