Google has been indexing PDF files - a stalwart of the myriad content forms accessible on the web - since 2001, and by our very rough estimate there are over 500 million PDFs currently indexed by the search engine.
There remains quite a bit of confusion surrounding the use of PDFs in the relation to ranking on search engines, but Google has showed a quick glance at its cards, clearing up at least some confusion today in a post about the format in search results. If you're responsible for website promotion and driving traffic to its pages, pay close attention:
- Google can index textual content from PDF files provided they are not password protected, but by using OCR (Optical Character Recognition) can extract text from images. To be safe, disconnect the actual text from design elements.
- Links within PDFs are treated similarly to HTML links, passing PageRank and other indexing signals. The links can not have the "nofollow" tag. Since PDF content is crawled, it is important therefore to follow the best practice guidance on the use of keyword-rich anchor text.
- Google recommends a single copy of content (e.g. HTML, PDF, etc) but has provided a way for content owners/publishers to indicate the preferred URL by specifying the canonical version in the HTML or the HTTP haeders of the PDF.
- There are two ways to influence the title of the PDF which appears in the search results - the metadata within the PDF and the anchor text of the links pointing to the PDF across the Web.