Recently, our team was working on remediation for a client and two of their PDFs were part of the deliverables. After opening the PDFs and working through the Accessibility tool in Adobe Acrobat it became clear – something was amiss. Turns out that the images in the PDF were a drafting software, like CAD® that rendered the images in multiple layers causing accessibility errors. I’ll get to that tip in a bit but first a brief history of the PDF and why it is important to have an accessible document – usable by everyone.
'Web Accessibility: A Beginners Guide'
Breakdown of a Portable Document Format (PDF)
The Portable Document Format (PDF) was developed by Adobe Systems and introduced in January 1993 at the Windows and OS|2 Conference. The purpose was to support and capture documents from any application and send electronic formats of these documents anywhere, like we do today. Back then it allowed the document to be viewed (and printed) on any machine. Today, it is the most recognizable and widely used file format for businesses worldwide. But it is not the most accessible format.
Many page description languages (PDL) exist today; however, Printer Command Language (PCL) and PostScript are the most common and widely adopted. A PDF relies on the same imaging model as the (PDL) which is a high-level computer language that describes the appearance and layout of a printed page. It also breaks down the text and graphics in a device-independent and resolution-independent manner. Basically, the layout of a page to be displayed or printed and the text or graphics orientated on that page to be displayed is independent from the native file that created it.
Then there is Optical Character Recognition (OCR), it is the process that converts an image of text into a machine-readable text format. For example, if you scan a form or a receipt, your computer saves the scan as an image file. You cannot use a text editor to edit, search, or count the words in the image file.
Why PDFs are NOT easily Accessible
For a document or application to be considered accessible it needs to meet certain technical criteria to be used by people with disabilities. This includes access by people who are blind, low vision, deaf, hard of hearing, mobility impaired, situational disability, or who have cognitive impairments. The accessibility guidelines for this fall under the Web Content Accessibility Guidelines (WCAG) 2.1 (soon WCAG 2.2 release this December 2022). These guidelines cover a wide range of recommendations for making content more accessible to people with disabilities. One benefit of following these guidelines is that content becomes more usable for all users.
With accessibility features built into Adobe Acrobat (paid version) and Adobe Reader (free version) the Portable Document Format (PDF) should make it easier for people with disabilities to use PDF documents and forms – yet they don’t for the most part. According to Adobe …”PDF should be usable without the aid of assistive technology software and devices such as screen readers, screen magnifiers, text-to-speech software, speech recognition software, alternative input devices, Braille embossers, and refreshable Braille displays.”
PDF ‘FIELDS’ – Fix #1
So, if the PDF is broken down into texts and images then why does it get jumbled up in the accessibility part? The tagged tree as seen in the image to the right or ‘FIELDS” need to be in the correct order or structure. In this PDF example (Adobe Acrobat used) the order of the fields is just one of the PDF culprits that needs to be corrected first.
You can create a form field by choosing one of the Adobe form tools. For this I used the ‘Prepare Form’ and after it generates, the FIELDS will populate into a hierarchy-type, tagged tree window. For each field type is a location on that PDF. You can set various options through the form fields, you can also drag and drop them into the correct part of the FIELDS tree or hierarchy. It is within those fields the structures that attempt to mimic the document flow. However, I have discovered many accessibilities and screen-reader issues are hidden in those fields not in their correct order as originally designed by the native file or program.
If you take and select a field like seen right, in light blue as a T (text box) and for this example is labeled ‘undefined’ the PDF will highlight the area in the PDF. What is wrong with this is that the T is not a text box but an ‘undefined’ image on the PDF that the page description language or PDL read incorrectly and labeled it anyways as an ‘undefined’ Text box.
First, check ascending top to bottom in the FIELDS that each field is flowing correctly and not jumping around the page. If you select say ‘Page 1’ it will jump t that page and then work from there. If it is a PDF form for a business, you will need to take your time and check each field to see if they are in the correct reading order.
A logical tagged structure tree is used within each document to provide a meaningful reading order for content, as well as a method for defining structural element’s role and relationship to page content.
Once you successfully go through the PDF fields, from left-to-right, top-to-bottom then the screen reader and assistive technology will read it correctly.
PDF ‘undefined’ – Fix #2
As mentioned above, that image that the PDF thinks it sees is not a Text field but an image and is labeled as ‘undefined’. This undefined field (in this example) is part of a layered drafting software used to create the main image on that PDF page. The layers in the original drafting image have affected how the field was defined in the PDF.
Many of the ‘undefined’ labels need to be deleted. Don’t worry, if you want, create a duplicate PDF and try it you will see it is not affecting the image, it is the PDF trying to decipher it. The ‘undefined’ text box is usually just a floating text box that has no real place in the PDF. It was added because the PDL created incorrect auto tags and read it wrong, so go ahead – delete it.
If the PDF doesn’t tag these fields correctly the reading order is jumbled. If it doesn’t flow from left to right, and top to bottom, it will confuse the screen reader and the PDF is rendered inaccessible.
PDF images and scanned PDFs – Fix #3
Scanned PDFs are bad. When someone just scans a book or piece of paper, those images are hard to decipher or not at all by the imaging model or PDL. Those scanned images are not accessible because they are just an image of text. You have to OCR (Optical Character Recognition) where it tries to figure out the words in the image. Bad scans = bad OCR.
Images need to be defined as well. Using a .png or .jpeg image is preferable. Optimal image resolution for a PDF image (not being printed by a commercial printer) and is for computer screen display is 100 dpi. Keeping the high-resolution images for other documents is not needed here. This will reduce the file size of the PDF and it will upload faster as a smaller document. This is the best for online performance and even better for mobile formats. Faster loading, easy user flow through the document – better UX, user experience.
So, going through the tree structure to confirm the fields are tagged correctly and alt-text is added at this point you’re in the home run! Within this tag structure, other properties such as alternative text (alt+text) and replacement text can be provided to make an image understandable and read out to assistive technology. As mentioned before, and accessible PDF is a better performing document overall.
For now, this will help address come major accessibility issues for PDFs and if you need further assistance, you can always reach out an ask us – at Accessiblü we love to talk about accessibility!