I have two PDF files which look the same and I want to extract data with this function:
all_data = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
table = page.extract_table()
if table:
filtered_table = table[5:] # Salta le intestazioni
header = [clean_text(h) for h in filtered_table[0]] # Pulisci l'intestazione
data = filtered_table[1:]
# Rimuovi righe vuote dalla tabella
data = [row for row in data if any(cell and cell.strip() for cell in row)]
# Appendi i dati della pagina all'elenco
all_data.extend(data)
if not all_data:
return
df = pd.DataFrame(all_data, columns=header)
df.dropna(how='all', inplace=True)
When I try to extract data this code works for just one PDF. I opened both PDFs in Visual Studio Code and they appear not the same in the first rows.
The one that works :
%PDF-1.7
%����
7 0 obj
<<
/Type /XObject
/Subtype /Image
/Width 242
/Height 43
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Interpolate false
/Filter /FlateDecode
/Length 8514
>>
stream
The one that doesn’t work:
%PDF-1.7
%����
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(en) /StructTreeRoot 61 0 R/MarkInfo<</Marked true>>/Metadata 590 0 R/ViewerPreferences 591 0 R>>
endobj
2 0 obj
<</Type/Pages/Count 12/Kids[ 4 0 R 25 0 R 30 0 R 33 0 R 37 0 R 40 0 R 43 0 R 46 0 R 49 0 R 52 0 R 55 0 R 58 0 R] >>
endobj
3 0 obj
<</Title(��Bando 2 CU Allegato B - Piano delle installazioni dettagliato) /Author(GSE) /CreationDate(D:20240904152507+00'00') /ModDate(D:20240904152507+00'00') /Producer() /Creator() >>
endobj
4 0 obj
<</Type/Page/Parent 2 0 R/Resources<</XObject<</Image6 6 0 R/Image12 12 0 R/Image15 15 0 R>>/ExtGState<</GS7 7 0 R/GS10 10 0 R>>/Font<</F1 8 0 R/F2 16 0 R/F3 18 0 R/F4 20 0 R>>/Pattern<</P11 11 0 R/P13 13 0 R/P14 14 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 841.92 595.32] /Contents 5 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>
endobj
5 0 obj
<</Filter/FlateDecode/Length 4379>>
stream
I have to write the data to Excel, and when I write data from the second PDF a lot of binary code comes between data in Excel.
I exported both PDF files from Excel, I know it’s strange but I need it for a reason
I need to happen that both PDF are read correctly.
2
Update, the real problem is the mapping function, because i tried to print what pdfplumber read and it’s correct, but when it writes it write a lot of binary code inside the excel
def apply_mapping(text):
for key, value in data_mapping.items():
if key in text:
return value
return text and when i call it
for col in df.columns:
df[col] = df[col].apply(lambda x: apply_mapping(clean_text(str(x))) if x is not None else “”)
that doesn’t work
it returns me (and the script doesn’t work just when this future warning appears
FutureWarning: Series.getitem treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use ser.iloc[pos]
df[col] = df[col].apply(lambda x: apply_mapping(clean_text(str(x))) if x is not None else “”)