Is there a simple way to identify if a pdf is scanned?











up vote
6
down vote

favorite












I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?




  1. Most pdfs are reports. Thus they have a lot of text.


  2. They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.




    • NotScanned

    • Scanned1

    • Scanned2



  3. The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:



Scanned:



grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>


Not Scanned:



grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>


The number of images per page are much bigger (about one per page)!










share|improve this question




















  • 7




    Do you mean whether they're text or images?
    – DK Bose
    yesterday






  • 8




    Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
    – sudodus
    yesterday






  • 3




    @sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
    – pipe
    yesterday






  • 1




    Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
    – dessert
    yesterday






  • 1




    If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/, which can be found with the command line grep --color -a 'Image' filename.pdf. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
    – sudodus
    yesterday

















up vote
6
down vote

favorite












I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?




  1. Most pdfs are reports. Thus they have a lot of text.


  2. They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.




    • NotScanned

    • Scanned1

    • Scanned2



  3. The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:



Scanned:



grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>


Not Scanned:



grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>


The number of images per page are much bigger (about one per page)!










share|improve this question




















  • 7




    Do you mean whether they're text or images?
    – DK Bose
    yesterday






  • 8




    Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
    – sudodus
    yesterday






  • 3




    @sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
    – pipe
    yesterday






  • 1




    Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
    – dessert
    yesterday






  • 1




    If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/, which can be found with the command line grep --color -a 'Image' filename.pdf. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
    – sudodus
    yesterday















up vote
6
down vote

favorite









up vote
6
down vote

favorite











I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?




  1. Most pdfs are reports. Thus they have a lot of text.


  2. They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.




    • NotScanned

    • Scanned1

    • Scanned2



  3. The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:



Scanned:



grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>


Not Scanned:



grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>


The number of images per page are much bigger (about one per page)!










share|improve this question















I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?




  1. Most pdfs are reports. Thus they have a lot of text.


  2. They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.




    • NotScanned

    • Scanned1

    • Scanned2



  3. The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:



Scanned:



grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>


Not Scanned:



grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>


The number of images per page are much bigger (about one per page)!







command-line pdf






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday









muru

133k19282481




133k19282481










asked yesterday









DanielTheRocketMan

3321314




3321314








  • 7




    Do you mean whether they're text or images?
    – DK Bose
    yesterday






  • 8




    Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
    – sudodus
    yesterday






  • 3




    @sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
    – pipe
    yesterday






  • 1




    Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
    – dessert
    yesterday






  • 1




    If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/, which can be found with the command line grep --color -a 'Image' filename.pdf. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
    – sudodus
    yesterday
















  • 7




    Do you mean whether they're text or images?
    – DK Bose
    yesterday






  • 8




    Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
    – sudodus
    yesterday






  • 3




    @sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
    – pipe
    yesterday






  • 1




    Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
    – dessert
    yesterday






  • 1




    If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/, which can be found with the command line grep --color -a 'Image' filename.pdf. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
    – sudodus
    yesterday










7




7




Do you mean whether they're text or images?
– DK Bose
yesterday




Do you mean whether they're text or images?
– DK Bose
yesterday




8




8




Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday




Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday




3




3




@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday




@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday




1




1




Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday




Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday




1




1




If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/, which can be found with the command line grep --color -a 'Image' filename.pdf. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
yesterday






If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/, which can be found with the command line grep --color -a 'Image' filename.pdf. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
yesterday












5 Answers
5






active

oldest

votes

















up vote
1
down vote



accepted










Shellscript




  • If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/.


  • In the same way you can search for the string /Text to tell if a pdf file contains text (not scanned).



I made the shellscript pdf-text-or-image, and it might work in most cases with your files. The shellscript looks for the text strings /Image/ and /Text in the pdf files.



#!/bin/bash

echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi

mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"

for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi


if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done


Make the shellscript executable,



chmod ugo+x pdf-text-or-image


Change directory to where you have the pdf files and run the shellscript.



Identified files are moved to the following subdirectories




  • scanned

  • text


  • s-and-t (for documents with both [scanned?] images and text content)


Unidentified file objects, 'UFOs', remain in the current directory.



Test



I tested the shellscript with two of your files, AR-G1002.pdf and AR-G1003.pdf, and with some own pdf files (that I have created using Libre Office Impress).



$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y

$ ls -1 *
pdf-text-or-image
pdf-text-or-image0

s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf

scanned:
AR-G1002.pdf

text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf


Let us hope that




  • there are no UFOs in your set of files

  • the sorting is correct concerning text versus scanned/images






share|improve this answer























  • instead of redirecting to /dev/null you can just use grep -q
    – phuclv
    yesterday










  • @phuclv, Thanks for the tip :-)
    – sudodus
    yesterday






  • 1




    @phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
    – sudodus
    18 hours ago


















up vote
6
down vote














  1. Put all the .pdf files in one folder.

  2. No .txt file in that folder.

  3. In terminal change directory to that folder with cd <path to dir>

  4. Make one more directory for non scanned files. Example:


    mkdir ./x 
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done


All the pdf scanned files will remain in the folder and other files will move to another folder.






share|improve this answer























  • this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
    – DanielTheRocketMan
    yesterday






  • 8




    Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
    – Joey
    yesterday






  • 2




    Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
    – jamesqf
    yesterday






  • 1




    @DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
    – Elder Geek
    yesterday






  • 1




    @DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
    – Elder Geek
    yesterday




















up vote
1
down vote













Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.



I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!






share|improve this answer








New contributor




ichabod is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.














  • 2




    I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
    – DanielTheRocketMan
    yesterday


















up vote
1
down vote













If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.



In general, for the files I could find on my computer and your test files, following is true:




  • Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page

  • Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software

  • PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.


I'm using Windows at the moment, so I used node.js for the following example:



const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");


const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;

const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;

const debug = DEBUG ? console.error : () => { };

(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });

for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);

if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}


// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);

if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}


To run it, you need to have Node.js installed (should be a single command) and you also need to call:



npm install mz pdf-parse


Usage:



node howYouNamedIt.js [scanned] [debug] [strict]

- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error


This example is not considered finished solution, but with the debug flag, you get some insight into meta information of a file:



FILE:  BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf


The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.



D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf

D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf


You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.






share|improve this answer





















  • Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
    – jamesqf
    13 hours ago


















up vote
0
down vote













2 ways I can think of:




  1. Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.


  2. Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)



eg



grep -rnw '/path/to/pdf/' -e 'e'


Use any of the text processing tools






share|improve this answer



















  • 1




    a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
    – phuclv
    yesterday










  • @phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
    – jamesqf
    yesterday






  • 1




    @jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
    – DanielTheRocketMan
    yesterday






  • 1




    i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
    – swapedoc
    yesterday











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1094198%2fis-there-a-simple-way-to-identify-if-a-pdf-is-scanned%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























5 Answers
5






active

oldest

votes








5 Answers
5






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










Shellscript




  • If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/.


  • In the same way you can search for the string /Text to tell if a pdf file contains text (not scanned).



I made the shellscript pdf-text-or-image, and it might work in most cases with your files. The shellscript looks for the text strings /Image/ and /Text in the pdf files.



#!/bin/bash

echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi

mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"

for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi


if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done


Make the shellscript executable,



chmod ugo+x pdf-text-or-image


Change directory to where you have the pdf files and run the shellscript.



Identified files are moved to the following subdirectories




  • scanned

  • text


  • s-and-t (for documents with both [scanned?] images and text content)


Unidentified file objects, 'UFOs', remain in the current directory.



Test



I tested the shellscript with two of your files, AR-G1002.pdf and AR-G1003.pdf, and with some own pdf files (that I have created using Libre Office Impress).



$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y

$ ls -1 *
pdf-text-or-image
pdf-text-or-image0

s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf

scanned:
AR-G1002.pdf

text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf


Let us hope that




  • there are no UFOs in your set of files

  • the sorting is correct concerning text versus scanned/images






share|improve this answer























  • instead of redirecting to /dev/null you can just use grep -q
    – phuclv
    yesterday










  • @phuclv, Thanks for the tip :-)
    – sudodus
    yesterday






  • 1




    @phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
    – sudodus
    18 hours ago















up vote
1
down vote



accepted










Shellscript




  • If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/.


  • In the same way you can search for the string /Text to tell if a pdf file contains text (not scanned).



I made the shellscript pdf-text-or-image, and it might work in most cases with your files. The shellscript looks for the text strings /Image/ and /Text in the pdf files.



#!/bin/bash

echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi

mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"

for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi


if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done


Make the shellscript executable,



chmod ugo+x pdf-text-or-image


Change directory to where you have the pdf files and run the shellscript.



Identified files are moved to the following subdirectories




  • scanned

  • text


  • s-and-t (for documents with both [scanned?] images and text content)


Unidentified file objects, 'UFOs', remain in the current directory.



Test



I tested the shellscript with two of your files, AR-G1002.pdf and AR-G1003.pdf, and with some own pdf files (that I have created using Libre Office Impress).



$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y

$ ls -1 *
pdf-text-or-image
pdf-text-or-image0

s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf

scanned:
AR-G1002.pdf

text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf


Let us hope that




  • there are no UFOs in your set of files

  • the sorting is correct concerning text versus scanned/images






share|improve this answer























  • instead of redirecting to /dev/null you can just use grep -q
    – phuclv
    yesterday










  • @phuclv, Thanks for the tip :-)
    – sudodus
    yesterday






  • 1




    @phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
    – sudodus
    18 hours ago













up vote
1
down vote



accepted







up vote
1
down vote



accepted






Shellscript




  • If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/.


  • In the same way you can search for the string /Text to tell if a pdf file contains text (not scanned).



I made the shellscript pdf-text-or-image, and it might work in most cases with your files. The shellscript looks for the text strings /Image/ and /Text in the pdf files.



#!/bin/bash

echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi

mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"

for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi


if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done


Make the shellscript executable,



chmod ugo+x pdf-text-or-image


Change directory to where you have the pdf files and run the shellscript.



Identified files are moved to the following subdirectories




  • scanned

  • text


  • s-and-t (for documents with both [scanned?] images and text content)


Unidentified file objects, 'UFOs', remain in the current directory.



Test



I tested the shellscript with two of your files, AR-G1002.pdf and AR-G1003.pdf, and with some own pdf files (that I have created using Libre Office Impress).



$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y

$ ls -1 *
pdf-text-or-image
pdf-text-or-image0

s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf

scanned:
AR-G1002.pdf

text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf


Let us hope that




  • there are no UFOs in your set of files

  • the sorting is correct concerning text versus scanned/images






share|improve this answer














Shellscript




  • If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/.


  • In the same way you can search for the string /Text to tell if a pdf file contains text (not scanned).



I made the shellscript pdf-text-or-image, and it might work in most cases with your files. The shellscript looks for the text strings /Image/ and /Text in the pdf files.



#!/bin/bash

echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi

mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"

for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi


if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done


Make the shellscript executable,



chmod ugo+x pdf-text-or-image


Change directory to where you have the pdf files and run the shellscript.



Identified files are moved to the following subdirectories




  • scanned

  • text


  • s-and-t (for documents with both [scanned?] images and text content)


Unidentified file objects, 'UFOs', remain in the current directory.



Test



I tested the shellscript with two of your files, AR-G1002.pdf and AR-G1003.pdf, and with some own pdf files (that I have created using Libre Office Impress).



$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y

$ ls -1 *
pdf-text-or-image
pdf-text-or-image0

s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf

scanned:
AR-G1002.pdf

text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf


Let us hope that




  • there are no UFOs in your set of files

  • the sorting is correct concerning text versus scanned/images







share|improve this answer














share|improve this answer



share|improve this answer








edited 18 hours ago

























answered yesterday









sudodus

21.2k32770




21.2k32770












  • instead of redirecting to /dev/null you can just use grep -q
    – phuclv
    yesterday










  • @phuclv, Thanks for the tip :-)
    – sudodus
    yesterday






  • 1




    @phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
    – sudodus
    18 hours ago


















  • instead of redirecting to /dev/null you can just use grep -q
    – phuclv
    yesterday










  • @phuclv, Thanks for the tip :-)
    – sudodus
    yesterday






  • 1




    @phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
    – sudodus
    18 hours ago
















instead of redirecting to /dev/null you can just use grep -q
– phuclv
yesterday




instead of redirecting to /dev/null you can just use grep -q
– phuclv
yesterday












@phuclv, Thanks for the tip :-)
– sudodus
yesterday




@phuclv, Thanks for the tip :-)
– sudodus
yesterday




1




1




@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
18 hours ago




@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
18 hours ago












up vote
6
down vote














  1. Put all the .pdf files in one folder.

  2. No .txt file in that folder.

  3. In terminal change directory to that folder with cd <path to dir>

  4. Make one more directory for non scanned files. Example:


    mkdir ./x 
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done


All the pdf scanned files will remain in the folder and other files will move to another folder.






share|improve this answer























  • this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
    – DanielTheRocketMan
    yesterday






  • 8




    Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
    – Joey
    yesterday






  • 2




    Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
    – jamesqf
    yesterday






  • 1




    @DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
    – Elder Geek
    yesterday






  • 1




    @DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
    – Elder Geek
    yesterday

















up vote
6
down vote














  1. Put all the .pdf files in one folder.

  2. No .txt file in that folder.

  3. In terminal change directory to that folder with cd <path to dir>

  4. Make one more directory for non scanned files. Example:


    mkdir ./x 
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done


All the pdf scanned files will remain in the folder and other files will move to another folder.






share|improve this answer























  • this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
    – DanielTheRocketMan
    yesterday






  • 8




    Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
    – Joey
    yesterday






  • 2




    Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
    – jamesqf
    yesterday






  • 1




    @DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
    – Elder Geek
    yesterday






  • 1




    @DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
    – Elder Geek
    yesterday















up vote
6
down vote










up vote
6
down vote










  1. Put all the .pdf files in one folder.

  2. No .txt file in that folder.

  3. In terminal change directory to that folder with cd <path to dir>

  4. Make one more directory for non scanned files. Example:


    mkdir ./x 
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done


All the pdf scanned files will remain in the folder and other files will move to another folder.






share|improve this answer















  1. Put all the .pdf files in one folder.

  2. No .txt file in that folder.

  3. In terminal change directory to that folder with cd <path to dir>

  4. Make one more directory for non scanned files. Example:


    mkdir ./x 
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done


All the pdf scanned files will remain in the folder and other files will move to another folder.







share|improve this answer














share|improve this answer



share|improve this answer








edited yesterday









dessert

21k55896




21k55896










answered yesterday









Hobbyist

979617




979617












  • this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
    – DanielTheRocketMan
    yesterday






  • 8




    Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
    – Joey
    yesterday






  • 2




    Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
    – jamesqf
    yesterday






  • 1




    @DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
    – Elder Geek
    yesterday






  • 1




    @DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
    – Elder Geek
    yesterday




















  • this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
    – DanielTheRocketMan
    yesterday






  • 8




    Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
    – Joey
    yesterday






  • 2




    Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
    – jamesqf
    yesterday






  • 1




    @DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
    – Elder Geek
    yesterday






  • 1




    @DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
    – Elder Geek
    yesterday


















this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday




this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday




8




8




Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday




Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday




2




2




Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday




Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday




1




1




@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday




@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday




1




1




@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday






@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday












up vote
1
down vote













Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.



I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!






share|improve this answer








New contributor




ichabod is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.














  • 2




    I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
    – DanielTheRocketMan
    yesterday















up vote
1
down vote













Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.



I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!






share|improve this answer








New contributor




ichabod is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.














  • 2




    I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
    – DanielTheRocketMan
    yesterday













up vote
1
down vote










up vote
1
down vote









Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.



I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!






share|improve this answer








New contributor




ichabod is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.



I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!







share|improve this answer








New contributor




ichabod is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this answer



share|improve this answer






New contributor




ichabod is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









answered yesterday









ichabod

111




111




New contributor




ichabod is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





ichabod is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






ichabod is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








  • 2




    I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
    – DanielTheRocketMan
    yesterday














  • 2




    I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
    – DanielTheRocketMan
    yesterday








2




2




I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday




I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday










up vote
1
down vote













If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.



In general, for the files I could find on my computer and your test files, following is true:




  • Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page

  • Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software

  • PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.


I'm using Windows at the moment, so I used node.js for the following example:



const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");


const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;

const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;

const debug = DEBUG ? console.error : () => { };

(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });

for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);

if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}


// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);

if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}


To run it, you need to have Node.js installed (should be a single command) and you also need to call:



npm install mz pdf-parse


Usage:



node howYouNamedIt.js [scanned] [debug] [strict]

- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error


This example is not considered finished solution, but with the debug flag, you get some insight into meta information of a file:



FILE:  BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf


The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.



D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf

D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf


You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.






share|improve this answer





















  • Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
    – jamesqf
    13 hours ago















up vote
1
down vote













If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.



In general, for the files I could find on my computer and your test files, following is true:




  • Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page

  • Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software

  • PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.


I'm using Windows at the moment, so I used node.js for the following example:



const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");


const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;

const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;

const debug = DEBUG ? console.error : () => { };

(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });

for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);

if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}


// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);

if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}


To run it, you need to have Node.js installed (should be a single command) and you also need to call:



npm install mz pdf-parse


Usage:



node howYouNamedIt.js [scanned] [debug] [strict]

- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error


This example is not considered finished solution, but with the debug flag, you get some insight into meta information of a file:



FILE:  BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf


The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.



D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf

D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf


You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.






share|improve this answer





















  • Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
    – jamesqf
    13 hours ago













up vote
1
down vote










up vote
1
down vote









If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.



In general, for the files I could find on my computer and your test files, following is true:




  • Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page

  • Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software

  • PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.


I'm using Windows at the moment, so I used node.js for the following example:



const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");


const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;

const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;

const debug = DEBUG ? console.error : () => { };

(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });

for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);

if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}


// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);

if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}


To run it, you need to have Node.js installed (should be a single command) and you also need to call:



npm install mz pdf-parse


Usage:



node howYouNamedIt.js [scanned] [debug] [strict]

- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error


This example is not considered finished solution, but with the debug flag, you get some insight into meta information of a file:



FILE:  BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf


The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.



D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf

D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf


You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.






share|improve this answer












If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.



In general, for the files I could find on my computer and your test files, following is true:




  • Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page

  • Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software

  • PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.


I'm using Windows at the moment, so I used node.js for the following example:



const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");


const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;

const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;

const debug = DEBUG ? console.error : () => { };

(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });

for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);

if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}


// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);

if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}


To run it, you need to have Node.js installed (should be a single command) and you also need to call:



npm install mz pdf-parse


Usage:



node howYouNamedIt.js [scanned] [debug] [strict]

- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error


This example is not considered finished solution, but with the debug flag, you get some insight into meta information of a file:



FILE:  BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf


The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.



D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf

D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf


You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.







share|improve this answer












share|improve this answer



share|improve this answer










answered yesterday









Tomáš Zato

169113




169113












  • Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
    – jamesqf
    13 hours ago


















  • Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
    – jamesqf
    13 hours ago
















Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
13 hours ago




Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
13 hours ago










up vote
0
down vote













2 ways I can think of:




  1. Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.


  2. Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)



eg



grep -rnw '/path/to/pdf/' -e 'e'


Use any of the text processing tools






share|improve this answer



















  • 1




    a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
    – phuclv
    yesterday










  • @phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
    – jamesqf
    yesterday






  • 1




    @jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
    – DanielTheRocketMan
    yesterday






  • 1




    i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
    – swapedoc
    yesterday















up vote
0
down vote













2 ways I can think of:




  1. Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.


  2. Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)



eg



grep -rnw '/path/to/pdf/' -e 'e'


Use any of the text processing tools






share|improve this answer



















  • 1




    a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
    – phuclv
    yesterday










  • @phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
    – jamesqf
    yesterday






  • 1




    @jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
    – DanielTheRocketMan
    yesterday






  • 1




    i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
    – swapedoc
    yesterday













up vote
0
down vote










up vote
0
down vote









2 ways I can think of:




  1. Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.


  2. Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)



eg



grep -rnw '/path/to/pdf/' -e 'e'


Use any of the text processing tools






share|improve this answer














2 ways I can think of:




  1. Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.


  2. Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)



eg



grep -rnw '/path/to/pdf/' -e 'e'


Use any of the text processing tools







share|improve this answer














share|improve this answer



share|improve this answer








edited yesterday









phuclv

318224




318224










answered yesterday









swapedoc

416




416








  • 1




    a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
    – phuclv
    yesterday










  • @phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
    – jamesqf
    yesterday






  • 1




    @jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
    – DanielTheRocketMan
    yesterday






  • 1




    i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
    – swapedoc
    yesterday














  • 1




    a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
    – phuclv
    yesterday










  • @phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
    – jamesqf
    yesterday






  • 1




    @jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
    – DanielTheRocketMan
    yesterday






  • 1




    i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
    – swapedoc
    yesterday








1




1




a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday




a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday












@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday




@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday




1




1




@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday




@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday




1




1




i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday




i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1094198%2fis-there-a-simple-way-to-identify-if-a-pdf-is-scanned%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Bundesstraße 106

Verónica Boquete

Ida-Boy-Ed-Garten