I used ocr for extract text from images, and I want to correct this text by using deep learning algorithm I have a dataset contains files for wrong text and files for correct text correspondent. want to train the model with these data and at last if I give the model a text with some wrong letters, then the model predict correct text

can you propose some algorithm that I can use for this problem and how use it.

is BERT algorithm works with this case?

https://preview.redd.it/5w2ikho1t8ca1.png?width=446&format=png&auto=webp&v=enabled&s=deee6f71fd819a1ee8a4c31669829bac0bc16b4f

Comments

You must log in or register to comment.

shmollerup t1_j4htyot wrote on January 15, 2023 at 8:52 PM

You could try something that works on a character level, like a sequence tobsequence model, or maybe a rnn approach like char2vec. Both approaches should work pretty good if you have enough training data

thatoneboii t1_j4jf8h5 wrote on January 16, 2023 at 3:23 AM

Do you absolutely need to use deep learning? There are tons of way faster autocorrect implementations that use levenshtein distances and non-DL techniques such as SymSpell or Norvig’s algorithm. DL is complicated, expensive, and requires tons of data to train on - I would stay away from that unless you’re doing it for your own enrichment or a school project.

Legitimate-Gold-8711 OP t1_j4kfxo9 wrote on January 16, 2023 at 9:40 AM

I tried levenshtein distance algorithm but it's not works like I want

SupremeChampionOfDi t1_j4l4r6u wrote on January 16, 2023 at 2:16 PM

I read this in a funny Chinese accent for some reason.

Legitimate-Gold-8711 OP t1_j4l9swg wrote on January 16, 2023 at 2:54 PM

Which reasons :D

SupremeChampionOfDi t1_j4lwzff wrote on January 16, 2023 at 5:26 PM

This how a Chinese person I know speaks.