As someone who’s done this extensively - I less it was made in word the pdf can be complete gibberish or have massive paragraph / sentence errors. Heck - the OCR can outright misread words or take a 3 letter word and make it 2 with 15 random ascii characters in between.
SupplyChainNext t1_j2lik4g wrote
Reply to [D] Data cleaning techniques for PDF documents with semantically meaningful parts by cm_34978
As someone who’s done this extensively - I less it was made in word the pdf can be complete gibberish or have massive paragraph / sentence errors. Heck - the OCR can outright misread words or take a 3 letter word and make it 2 with 15 random ascii characters in between.
It’s a crap shoot.
God speed.