(This article assumes that the reader is familiar with Linux, the "foremost" program, and data rescue in general.)
In Linux, the dd, ddrescue, dd_rescue, and/or dd_rhelp commands can be used to create image files of a hard drive or other storage medium. Once the image of the drive is acquired, the process of file carving can be executed. It's worth mentioning at this point that file carving programs can be used on live drives too. But most people use file carving on image files of drives. Also, it's good to know that even though the Linux operating system (OS) is used, the drive whose image is being scanned can be from OSes other than Linux. For example, "foremost" is very good at finding files on FAT12, FAT16, FAT32, NTFS, and other file systems.
File carving programs inspect patterns of information in a drive image file and attempt to reconstruct lost files (as well as files that are not lost, depending on the program settings chosen.) "foremost" creates thirty-two subdirectories, one for each of the types of files it recognizes, and puts the recognized files in appropriate folders.
No file carving program is perfect in recognizing files. Sometimes data sequences are detected as valid file headers when they are not. This causes a significant number of false positive detections.
I use "foremost" for file carving, and in particular for finding lost jpeg graphic image files. It works well, but it does gives a fair number of these false positives. These are pieces of files that have proper headers, but which are not really viable graphic images, often because parts of the file were overwritten by other files. (Note that "foremost" seems not to recognize "tif" or "tiff" graphics image files by default, although it does recognize jpg, png, bmp, and gif graphic images. The file "/etc/foremost.conf" does have a line that can be used to enable "tif" recognition.)
In a file recovery project on a Windows hard drive it would not be unusual to find tens of thousands of prospective graphic images, of which maybe 10 per cent or more are actually not viewable graphic images. That is not to say that all the viewable files are interesting files. Some jpeg files will be graphic images from web site graphics, and will not be of interest to anyone.
I wrote a set of scripts that filters out many bad files (bad in that they are not viewable graphic images.) It doesn't, however, distinguish graphic images of interest from uninteresting graphic images. It only distinguishes displayable graphic images from files that were false positives.
The basis of the script is the ImageMagick suite, specifically the program "identify". "identify" inspects a file and reports back certain characteristics. If the file it is inspecting is not a valid graphic image file, "identify" provides a non-zero return code to the shell. This return code can be inspected by checking the shell variable $? . If the return code value is non-zero, the file can be deleted or moved to a different directory / folder. In the case of my scripts, the file is deleted.
See the warning below. Running the second script "chkdel"from the command line with the wrong parameters can cause unwanted deletion of files. The second script "chkdel" should not be run from the command line. If you choose to run it anyway, any file that is not a recognized graphic image file will be deleted. Use the first script "deleter" with proper wildcards to limit the type of file to be tested and potentially deleted.
I implemented the scripts in two separate files. One I call "deleter". It simply runs a "for" loop on all the jpeg files in the current directory. The action of the loop calls another script file "chkdel" that invokes "identify", checks the return value, and either does nothing (if the file is a valid graphics image file) or deletes the file (if it's not a valid graphics image file).
The "deleter" script:
for files in *.jpg
do
. ./chkdel $files
done
The "chkdel" script:
identify $1
if [ $? -gt 0 ]
then
rm $1
fi
Both scripts need to have the "execute" attribute set so they
can run. Use the "chmod" command:
chmod u+x deleter
and
chmod u+x chkdel
WARNING: If the "chkdel" script is run from the command line with the wrong parameters (for example "*" or "*.*") any file that is not a valid image file will be delted. Another kind of error would be to call "chkdel" with a specification like "*.doc". That would cause all "doc" files to be deleted because they are not valid graphics image files.
For this reason, you should not use "chkdel" from the command line unless you are willing to lose files needlessly. Instead, use "deleter" after changing the file specification from "*.jpg" to the kind of file you want to test. Remember that files will be tested to see if they are valid graphics image files, not to see if they are valid files of any sort.
You have been warned.
The scripts are invoked:
. ./deleter
That is "dot space dot slash deleter". The leading
dot is shorthand for the "source" command. The second
dot and the slash designate that the file "deleter"
is in the current directory.
Remember that "deleter" calls the "chkdel" script, so you won't actually have to run the "chkdel" command from the shell command line yourself. When called from "deleter" the first line of "chkdel" invokes "identify" and reports back the size of the image, if it's a valid image. If the file is not a valid image "identify" returns a non-zero return code.
Note that in the second line of "chkdel", there must be spaces before and after both of the brackets. If these spaces are omitted, the script will fail.
The "if" statement tests $?, which is the return code of the most recently executed statement ("identify" in this case.) If the return code was greater than zero, indicating that the file was not a valid graphic image file, the "then" statement runs. It simply deletes the file passed to the "chkdel" script. This leaves only files in the directory that can be displayed.
Once the scripts have finished, any files that "identify" couldn't identify are gone. What's left in the folder should be images that can be displayed.
This is not a foolproof way of removing all junk jpeg files. Some files may have parts of the image missing. But at least the file is in a legitimate jpeg format. At this point a human must view the files and filter out undesireable images.
The "identify" command understands a variety of graphic image types in addition to jpeg files. The scripts can be copied to the "bmp", "gif", and "png" directories and run so long as the "*.jpg" suffix gets changed to the proper file type (in the "deleter" script.) If the file specification doesn't get changed to "bmp" when it's run from the "bmp" directory (or "gif" for the "gif" directory, etc.), but is instead run with the default "*.jpg" specification, nothing will happen because "deleter" checks for specific files by file extension / type (unlike the "foremost" command). Change the file specification in "deleter" as appropriate for the directory from which you want to eliminate invalid files. See the warning about the dire consequences of running these scripts in a directory that doesn't have graphics images. This set of scripts might also be adaptable to other kinds of files than image files.
The version of "foremost" I have been using can recognize 32 different file types by default. In the chkdel script, the "identify" command could be replaced with a different command, one that could test the validity of other kinds of files. For example, if there is a program that can check the contents of a Word file and return a non-zero status if the file is not really a valid Word file, while returning a zero status for a valid Word file, then that command could be used in place of the "identify" command to test the validity of files originally detected as Word files.
Of course, in that case the "deleter" command would need to be changed to match the file type of interest.
For drives or drive image files where there was much deletion of files over a long period of time, the proportion of false positives will be higher than it would be for a relatively newer drive. The reason for this is that over time, more files will have been deleted and partially overwritten on a drive that has been used a lot. This set of scripts can help make the job of finding lost files a little easier.
Enjoy!