Forum OpenACS Development: Comments within cr_check_mime_type

Collapse
Posted by Iuri Sampaio on
Yesterday, while I was having fun, debugging ad_procs I found cr_check_mime_type, and some interesting comments within it.

# TODO: we use only the extension to get the mimetype. Something
# better should be done, like inspecting the actual content of the
# file and never trust the user on this regard, but as this
# involves changes also in the data model, we leave this for the
# future. Usages of this proc in the systems are already set to
# give us the path to the file here.

api-doc/proc-view?proc=cr_check_mime_type&source_p=1

I always had the same feeling, but I'd also always postponed further analysis on this subject, until now. I tried to write a TCL chunk to inspect what's within the file.

But it needed too much customization, and sometimes even files, with the same extensions (let's say PDF files) had a different schema, if they were from different sources. To each of them, I need to customize something.

For example, .docx files exported to PDF, using a PDFCreator and other file exported by MS Office own's converter.

In another case, the TCL chunk scanned PNG images, one exported from Adobe Photoshop, and the other created by screenshot feature on MAC. They had different schema too. Only one had PNG within its first line.

set file_type [file type ${file.tmpfile}]
ns_log Notice "TYPE $file_type"
set file_extension [file extension ${file.tmpfile}]
ns_log Notice "FILE EXT $file_extension"

set fl [open ${file.tmpfile}]
set f_line [gets $fl line]
ns_log Notice "LINE $f_line"
set data [read $fl]
ns_log Notice "FILE \n $data"

There's some interesting code, written to images. Thanks Dave!
Is there any to inspect/scan PDF files?

api-doc/proc-view?proc=image::identify_binary&source_p=1
and
api-doc/proc-view?proc=image::imagemagick_identify&source_p=1

Collapse
Posted by Antonio Pisano on
Hi Iuri,

some approaches available now which you might consider for filetype detection is:
- available tcl API (e.g. [1])
- wrapping the "file" command line utility (on unix-like systems).

Keep in mind that, although stricter, checking file type by its content is much more expensive than the lazy, extension-based approach (requires file IO, sometimes an exec...). Also, the more you want recognition to be "type specific" (e.g. number of pages in a pdf, width of a png...) and the more is likely you will need a special tool/lib for this.

About pdfs, in [1] you can see they are also recognized (but you should check with different variants). If you need some further content inspection, the pdfinfo command from the poppler-utils works quite fine and we have also some wrapping for this in [2]

[1] - https://core.tcl.tk/tcllib/doc/tcllib-1-18/embedded/www/tcllib/files/modules/fileutil/fileutil.html#11
[2] - https://openacs.org/api-doc/proc-view?proc=util::pdfinfo&source_p=1

Collapse
Posted by Iuri Sampaio on
Thanks Antonio.
That's precisely the information I need. I'm aware that depending on the feature a third party app would be implied. Plus I/O would mean performance decrease, and so on.

A good example is based on what I've seen from Dave, in the scenarios ImageMagick has been applied, and etc.

The main idea, which derived these daydreams, was because sometimes we deal with files, uploaded by the users, and the system is providing them to other users. Meaning, my system could potentially harm/infect another computer if malicious or even unaware members upload their files within viruses, and/or malicious code, macros etc.

Currently, NGINX is blocking most of the dangerous ones, and the rest I have left to OACS god's hands!

I know there are tons to be implemented still. The post was a very good coincidence between hat I have experienced, by writing code, and the comments that I found within ad_proc cr_check_mime_type.

Once in a while, I get stuck on basic/fundamentals troubleshooting and I decide to recycle a bit, instead of rushing things up.