How to scrape PDF from CNKI #116

lullabymia · 2018-11-30T09:51:20Z

Troubleshooting

Describe your environment

Operating system: mac
Python version: 3.7
Hardware: macbook
Internet access:
Jupyter notebook or not? [Y/N]: Y
Which chapter of book?: chapter 6

Describe your question

I want to scrape the whole PDF text from people's daily in CNKI but have no idea how to do it. Do I need to download all the articles?

Describe the efforts you have spent on this issue

I found this website about pdf scraping
https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

hupili · 2018-11-30T15:51:50Z

@lullabymia , does the medium article work? Seems it included a complete example to extract text from PDFs.

lullabymia · 2018-12-01T08:41:11Z

I followed the instruction in the article. But it seemed that I cannot install textract at the beginning.

@hupili

ChicoXYC · 2018-12-01T09:39:54Z

@lullabymia How many files you have? Maybe you can send to me, I will help you doing OCR with tools like filereader ocr

lullabymia · 2018-12-01T12:56:59Z

We might need to scrape thousands of PDF files in this website http://navi.cnki.net/KNavi/NPaperDetail?pcode=CCND&bzpym=RMRB (at least from Jan 1, 2008-June 31, 2008 )
Do I need to download all the PDF before scraping?
@ChicoXYC

ChicoXYC · 2018-12-01T16:27:52Z

@lullabymia yes, you need first get pdf from the website, and the following is an example of words extraction from pdf:
https://github.com/ChicoXYC/exercise/blob/master/extract-words-pdf/extract-words-with-pdf.ipynb

ChicoXYC · 2018-12-01T16:39:22Z

I also tried above method mentioned in medium, cannot proceed more now with installing module. Will try later.https://github.com/ChicoXYC/exercise/blob/master/extract-words-pdf/failure-extract-pdf-textract.ipynb

hupili · 2018-12-03T15:28:37Z

Don't dwell on textract. Actually PyPDF2 alone already works as shown in this comment

lullabymia · 2018-12-10T07:27:00Z

But there is an error called
Multiple definitions in dictionary at byte 0x7eb1 for key /MediaBox

@ChicoXYC

lullabymia · 2018-12-12T05:45:15Z

Besides of the above reading error, I cannot click the link of "download pdf" in the webpage
(and it seems that when I paste the link, it will download caj instead of pdf)
http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!

import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)
html=browser.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
links = soup.find('div',attrs={'class':"dllink"})
link = links.find('a',attrs={'class':"icon icon-dlpdf"})
link.click()

@ChicoXYC Can you also help me with that ?

ChicoXYC · 2018-12-12T10:35:26Z

@lullabymia You need use selenium instead of requests method to find elements. The following will help

import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)

link = browser.find_element_by_css_selector('.dllink a.icon.icon-dlpdf')
link.click()

ChicoXYC · 2018-12-12T10:51:15Z

@lullabymia

import PyPDF2
import os
path='pdfs/'   #pass the path where your pdf files locate. suggest to put them into the folder where your jupyter notebooks are
for file in os.listdir(path):
    pdfFileObject = open('pdfs/{0}'.format(file), 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
    count = pdfReader.numPages
    for i in range(count):
        page = pdfReader.getPage(i)
        print(page.extractText())

lullabymia · 2018-12-12T11:41:44Z

Clicking pdf worked and thank you! But the error called "Multiple definitions in dictionary at byte 0xcd34 for key /MediaBox" appeared again while reading pdf files. :(

@ChicoXYC

ChicoXYC · 2018-12-12T12:10:04Z

@lullabymia

checkout whether only pdf files are in the folder.
plz don't leave blank space in the folder name.
If it doesn't work, you can find me in the cva 808 to work out face to face.

hupili · 2018-12-14T16:28:20Z

redirect to #135

ChicoXYC added the discussion Extra attention is needed label Jan 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to scrape PDF from CNKI #116

How to scrape PDF from CNKI #116

lullabymia commented Nov 30, 2018 •

edited

hupili commented Nov 30, 2018

lullabymia commented Dec 1, 2018 •

edited

ChicoXYC commented Dec 1, 2018

lullabymia commented Dec 1, 2018

ChicoXYC commented Dec 1, 2018

ChicoXYC commented Dec 1, 2018 •

edited

hupili commented Dec 3, 2018

lullabymia commented Dec 10, 2018 •

edited

lullabymia commented Dec 12, 2018

ChicoXYC commented Dec 12, 2018

ChicoXYC commented Dec 12, 2018

lullabymia commented Dec 12, 2018

ChicoXYC commented Dec 12, 2018

hupili commented Dec 14, 2018

How to scrape PDF from CNKI #116

How to scrape PDF from CNKI #116

Comments

lullabymia commented Nov 30, 2018 • edited

Troubleshooting

Describe your environment

Describe your question

Describe the efforts you have spent on this issue

hupili commented Nov 30, 2018

lullabymia commented Dec 1, 2018 • edited

ChicoXYC commented Dec 1, 2018

lullabymia commented Dec 1, 2018

ChicoXYC commented Dec 1, 2018

ChicoXYC commented Dec 1, 2018 • edited

hupili commented Dec 3, 2018

lullabymia commented Dec 10, 2018 • edited

lullabymia commented Dec 12, 2018

ChicoXYC commented Dec 12, 2018

ChicoXYC commented Dec 12, 2018

lullabymia commented Dec 12, 2018

ChicoXYC commented Dec 12, 2018

hupili commented Dec 14, 2018

lullabymia commented Nov 30, 2018 •

edited

lullabymia commented Dec 1, 2018 •

edited

ChicoXYC commented Dec 1, 2018 •

edited

lullabymia commented Dec 10, 2018 •

edited